Remove "Footer"


#1

I’ve got a series of PDFs that I’m trying to process…

generically speaking, we use the txt extractor to pull the raw text from the file:

    internal static StringBuilder GetStringBuilder(this string pdf)
    {
            var textAbsorber = new TextAbsorber();
            textAbsorber.ExtractionOptions.FormattingMode
                = TextExtractionOptions.TextFormattingMode.Pure;

            var pdfinfo = new FileInfo(pdf);
            var pdfDocument = new Document(pdfinfo.FullName);
            pdfDocument.Pages.Accept(textAbsorber);
            var pdftext = textAbsorber.GetStringBuilder();
            return pdftext;
    }

Now, normally this is fine… but the batch of files I’m working on includes a “header/footer” block that is at the top… bottom… and, most importantly… in the middle as a water mark.

sample Pictures.jpg (117.7 KB)

In the attached image, you can see that at the “end” of the page is a Text block. One block at the top. Two at the bottom. and “Copy of Original” large and in charge in the middle of the document.

I’m using this code block to search for and remove a fragment - but it only catches the “first” of the six in that final block:

    internal static Document DeleteMachineReadableCode(this Document pdfDocument)
    {
        var startEnd = $@"Copy of Electronic Original.*$";
        var textFragmentAbsorber = new TextFragmentAbsorber(startEnd);
        var textSearchOptions = new TextSearchOptions(true);
        textFragmentAbsorber.TextSearchOptions = textSearchOptions;

        pdfDocument.Pages.Accept(textFragmentAbsorber);

        var textFragmentCollection = textFragmentAbsorber.TextFragments;
        var count = textFragmentCollection.Count;
        foreach (TextFragment textFragment in textFragmentCollection)
            textFragment.Text = string.Empty;

        return pdfDocument;
    }

Can I remove a parent text fragment that contains the text fragment I’m searching for?


#2

@wernercd

Thanks for contacting support.

Would you please share your sample PDF Document with us so that we can test the scenario in our environment and address it accordingly.


#3

Sorry, was hoping the image would be enough… but this should be a good representation of the file.

DEC-Loan Document - 00e9d30f-79bd-49ef-8b83-d0ee63039aea_037.pdf (89.4 KB)


#4

@wernercd

Thanks for sharing sample PDF document.

Please note that the API retrieves/extracts the text fragments from the PDF in similar way with which they were added. However, in your particular scenario, you may please put some custom checks after extracting complete text of PDF document and remove desired text fragments. Please check following code snippet and attached PDF output, which we have generated in our environment:

var startEnd = ".+";
var textFragmentAbsorber = new TextFragmentAbsorber(startEnd);
var textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
Document pdfDocument = new Document(dataDir + "DEC-Loan Document.pdf");
pdfDocument.Pages.Accept(textFragmentAbsorber);

var textFragmentCollection = textFragmentAbsorber.TextFragments;
var count = textFragmentCollection.Count;
foreach (TextFragment textFragment in textFragmentCollection)
{
  if (textFragment.Text.ToLower().Contains("copy") || textFragment.Text.ToLower().Contains("original") || textFragment.Text.ToLower().Contains("of"))
    textFragment.Text = string.Empty;
 }
pdfDocument.Save(dataDir + "test18.12.out.pdf"); 

test18.12.out.pdf (88.7 KB)

In case above code snippet does not fulfill your requirements or you face any other issue, please feel free to let us know.