Remove header footer in PDF, where template of PDF cannot be predicted

Hi ,

I need to remove header and footer from PDF

(PDF submitted from external source , can’t predicted template of PDF ) ,
i had tried using (Aspose.Pdf.Facades.StampInfo), but it is not working.
in our case , we can’t predict header / footer region , so can’t use Aspose.Pdf.Rectangle
to remove the same. do we have any other way to remove header / footer ?

i have some other idea , but didn’t find a way to make it ,
since header/footer on all pages are same
Compare text on consecutive pages , if line-number and text matches,
then we will remove text .

@afsal.akbarsha,

Can you please share source file so that we may investigate further to help you out.

@Adnan.Ahmad

please check sample code to find duplicate texts.

//GET TEXT ON EACH LINE
asposePDFDoc.Pages.Accept(textFragmentAbsorber);
_textFragmentAbsorber = textFragmentAbsorber;
var PageTextCollectionOnEachLine = textFragmentAbsorber?.Text?.Replace(’\r’, ‘\n’)?.Split(’\n’)?.ToList();
PageTextCollectionOnEachLine?.RemoveAll(x => x.Length <= 0);

//GET DUPLICATE LINES
var duplicateLines = pageTextCollectionOnEachLine.GroupBy(x => x).Where(x => x.Count() > 1).Select(x=>x.Key);

do we have any method to remove these text in duplicate line ?

@afsal.akbarsha,

I like to inform that we have functionality to extract and update text. Can you please share source PDF with explanation which text and why should be removed.

header_footer_Outsider.pdf (820.5 KB)

@Adnan.Ahmad

i have attached sample PDF ,

our requirement ,
if PDF contains header/footer , then delete header and footer from PDF

i had tried using (Aspose.Pdf.Facades.StampInfo) , but it is not working.
in our case , we can’t predict header / footer region , so can’t use Aspose.Pdf.Rectangle
to remove the same. do we have any other way to remove header / footer

i have some other idea , but didn’t find a way to make it
i.e; Get duplicate lines from PDF and to delete the same

please check sample code to find duplicate texts.

// GET TEXT ON EACH LINE
asposePDFDoc.Pages.Accept(textFragmentAbsorber);
_textFragmentAbsorber = textFragmentAbsorber;
var PageTextCollectionOnEachLine = textFragmentAbsorber?.Text?.Replace(’\r’, ‘\n’)?.Split(’\n’)?.ToList();
PageTextCollectionOnEachLine?.RemoveAll(x => x.Length <= 0);

// GET DUPLICATE LINES
var duplicateLines = pageTextCollectionOnEachLine.GroupBy(x => x).Where(x => x.Count() > 1).Select(x=>x.Key);

do we have any method to remove these text in duplicate line ?

@afsal.akbarsha

Thanks for getting back to us with required information.

Please note that header and footer are defined on the time of PDF Generation and once PDF is generated, they become part of its content as simple text. In other words, there is no specific definition of header/footer in existing PDF document through which they can be determined. Since you already have mentioned that you cannot specify rectangle to trim from PDF Pages, this solution would not be suitable for you as well.

We regret to inform you that this workaround is also not reliable as it would not work with all PDF files you have. For example, the word “Cricket” is also mentioned on multiple pages in the same line. In order to further investigate the scenario, we have generated an investigation ticket as PDFNET-47512 in our issue tracking system. We will further look into details of the scenario and let you know as soon as we find some feasibility to achieve your requirements. Please spare us some time.

We are sorry for the inconvenience.

@Adnan.Ahmad

we dont need to remove word by word ,
we are planing to remove complete line having same text
so please ignore similar words like cricket …

I can find duplicate lines in pdf using below code

// GET TEXT ON EACH LINE
asposePDFDoc.Pages.Accept(textFragmentAbsorber);
_textFragmentAbsorber = textFragmentAbsorber;
var PageTextCollectionOnEachLine = textFragmentAbsorber?.Text?.Replace(’\r’, ‘\n’)?.Split(’\n’)?.ToList();
PageTextCollectionOnEachLine?.RemoveAll(x => x.Length <= 0);

// GET DUPLICATE LINE COLLECTION
var duplicateLineCollection = pageTextCollectionOnEachLine.GroupBy(x => x).Where(x => x.Count() > 1).Select(x=>x.Key);

please try this code , so you can find duplicate lines having same text
use attached PDF to find lines having duplicate texts
( mostly it would contains only header and footer )

i just need to remove duplicateLineCollection from pdf
duplicatelinesCollection mostly contains header and footer only
so we need to remove these text in these lines ,
even though it contains other same line texts , we don’t have problem
so please suggest a way to remove this **duplicatelinesCollection

new_updated_header and footer__.pdf (999.1 KB)

@afsal.akbarsha

You may use following code snippet in order to remove text in duplicateLineCollection:

foreach(var s in duplicateLineCollection)
{
 if (!String.IsNullOrWhiteSpace(s))
 {
  TextFragmentAbsorber absorber = new TextFragmentAbsorber(s);
  asposePDFDoc.Pages.Accept(absorber);
  foreach(var text in absorber.TextFragments)
  {
   text.Text = String.Empty;
  }
 }
}
asposePDFDoc.Save(dataDir + "updated.pdf");

You may also modify above code snippet as per your needs. In case you need further assistance, please feel free to let us know.

@Adnan.Ahmad

I have already tried using following code to remove the text,
but text-absrober didn’t find any string. please suggest some other way…

PFA
notworking.PNG (21.3 KB)

@afsal.akbarsha

Thanks for getting back to us.

This was not the case at our side while testing with Aspose.PDF for .NET 19.12. An output PDF is also attached for your reference. Would you kindly share a sample console application which is able to replicate the issue you are facing. We will further proceed to assist you accordingly.

updated.pdf (1002.9 KB)