Remove "Footer"

I’ve got a series of PDFs that I’m trying to process…

generically speaking, we use the txt extractor to pull the raw text from the file:

    internal static StringBuilder GetStringBuilder(this string pdf)
    {
            var textAbsorber = new TextAbsorber();
            textAbsorber.ExtractionOptions.FormattingMode
                = TextExtractionOptions.TextFormattingMode.Pure;

            var pdfinfo = new FileInfo(pdf);
            var pdfDocument = new Document(pdfinfo.FullName);
            pdfDocument.Pages.Accept(textAbsorber);
            var pdftext = textAbsorber.GetStringBuilder();
            return pdftext;
    }

Now, normally this is fine… but the batch of files I’m working on includes a “header/footer” block that is at the top… bottom… and, most importantly… in the middle as a water mark.

sample Pictures.jpg (117.7 KB)

In the attached image, you can see that at the “end” of the page is a Text block. One block at the top. Two at the bottom. and “Copy of Original” large and in charge in the middle of the document.

I’m using this code block to search for and remove a fragment - but it only catches the “first” of the six in that final block:

    internal static Document DeleteMachineReadableCode(this Document pdfDocument)
    {
        var startEnd = $@"Copy of Electronic Original.*$";
        var textFragmentAbsorber = new TextFragmentAbsorber(startEnd);
        var textSearchOptions = new TextSearchOptions(true);
        textFragmentAbsorber.TextSearchOptions = textSearchOptions;

        pdfDocument.Pages.Accept(textFragmentAbsorber);

        var textFragmentCollection = textFragmentAbsorber.TextFragments;
        var count = textFragmentCollection.Count;
        foreach (TextFragment textFragment in textFragmentCollection)
            textFragment.Text = string.Empty;

        return pdfDocument;
    }

Can I remove a parent text fragment that contains the text fragment I’m searching for?

@wernercd

Thanks for contacting support.

Would you please share your sample PDF Document with us so that we can test the scenario in our environment and address it accordingly.

Sorry, was hoping the image would be enough… but this should be a good representation of the file.

DEC-Loan Document - 00e9d30f-79bd-49ef-8b83-d0ee63039aea_037.pdf (89.4 KB)

@wernercd

Thanks for sharing sample PDF document.

Please note that the API retrieves/extracts the text fragments from the PDF in similar way with which they were added. However, in your particular scenario, you may please put some custom checks after extracting complete text of PDF document and remove desired text fragments. Please check following code snippet and attached PDF output, which we have generated in our environment:

var startEnd = ".+";
var textFragmentAbsorber = new TextFragmentAbsorber(startEnd);
var textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
Document pdfDocument = new Document(dataDir + "DEC-Loan Document.pdf");
pdfDocument.Pages.Accept(textFragmentAbsorber);

var textFragmentCollection = textFragmentAbsorber.TextFragments;
var count = textFragmentCollection.Count;
foreach (TextFragment textFragment in textFragmentCollection)
{
  if (textFragment.Text.ToLower().Contains("copy") || textFragment.Text.ToLower().Contains("original") || textFragment.Text.ToLower().Contains("of"))
    textFragment.Text = string.Empty;
 }
pdfDocument.Save(dataDir + "test18.12.out.pdf"); 

test18.12.out.pdf (88.7 KB)

In case above code snippet does not fulfill your requirements or you face any other issue, please feel free to let us know.

The main question I have, after some further research, is this seems to be painfully slow. I’m running this against the larger file that the test file is based on, and the Fragement Count is in the thousands and it takes minutes per file.

Is there any way to speed up the process?

            var folder = $@"D:\ALL_FILES";
            if (!Directory.Exists(folder))
                throw new UnauthorizedAccessException();

            var search = "*.pdf";
            var files = Directory.GetFiles(folder, search).Where(x => !x.EndsWith("-wow.pdf")).ToList();

            _logger.Info($"Running WOW! Converter on {files.Count} file(s)...");
            foreach (var file in files)
            {
                var fi = new FileInfo(file);
                var fiw = new FileInfo(file.Replace(".pdf", "-wow.pdf"));
                if (fiw.Exists)
                {
                    _logger.Info($"          {fiw.Name} exists, skipping...");
                    continue;
                }
                else
                {
                    _logger.Info($"          {fi.Name} Processing...");
                }

                var dir = fi.Directory.FullName;
                
                var startEnd = ".+";
                var textFragmentAbsorber = new TextFragmentAbsorber(startEnd);
                var textSearchOptions = new TextSearchOptions(true);
                textFragmentAbsorber.TextSearchOptions = textSearchOptions;
                
                _logger.Info($"          New Document...");
                Document pdfDocument = new Document(fi.FullName);

                _logger.Info($"          Accept Absorber...");
                pdfDocument.Pages.Accept(textFragmentAbsorber);
                var textFragmentCollection = textFragmentAbsorber.TextFragments;
                var count = textFragmentCollection.Count;

                _logger.Info($"          Foreach Fragement ({count})...");
                for (var index = 1; index <= textFragmentCollection.Count; index++)
                {
                    TextFragment textFragment = textFragmentCollection[index];
                    if (index % 100 == 0)
                         _logger.Debug($"               Foreach Fragement (#{index})...");
                    if (textFragment.Text.ToLower().Contains("copy") ||
                        textFragment.Text.ToLower().Contains("original") ||
                        textFragment.Text.ToLower().Contains("of"))
                        textFragment.Text = string.Empty;
                }

                _logger.Info($"          Save WOW! pdf...");
                pdfDocument.Save(fiw.FullName);
            }

@wernercd

Thanks for getting back to us.

In order to reduce the memory consumption and speed up things, you may extract text by using ‘per page’ processing and manual calling dispose on processed page objects like following:

TextFragmentAbsorber absorber = new TextFragmentAbsorber();
using (doc = new Aspose.Pdf.Document(myDir + "input.pdf"))
{
 foreach (Page page in doc.Pages)
 {
  page.Accept(absorber);
  page.Dispose();
 }
}

In case this does not help, please share your sample PDF document with which you are facing slow performance of the API. We will test the scenario in our environment and address it accordingly.

I switched to a method that’s fast enough for my needs:

instead of searching for words in the text, I search for text that matches the COPY OF ORIGINAL parts…

My main problem is that the text gets moved when the document gets saved. So if COPY is “on a line” with other text… “COPY” gets replaced with “”… the text AFTER copy on the line with copy gets moved to the left the corresponding distance that was covered by COPY.

            // string folder = "C:\Temp";
            // string search = "*.pdf";
            // string[] files = Directory.GetFiles(...).ToList();
            // string pdf = ".pdf"
            // string ender = "-wow.pdf";
            foreach (var file in files)
            {
                var fi = new FileInfo(file);
                var fiw = new FileInfo(file.Replace(pdf, ender)); // change .pdf to -wow.pdf
                if (fiw.Exists)
                    continue;

                var dir = fi.Directory.FullName;
                
                var startEnd = ".+";
                var textFragmentAbsorber = new TextFragmentAbsorber(startEnd);
                var textSearchOptions = new TextSearchOptions(true);
                textFragmentAbsorber.TextSearchOptions = textSearchOptions;
                
                _logger.Info($"          New Document...");
                Document pdfDocument = new Document(fi.FullName);

                _logger.Info($"          Accept Absorber...");
                pdfDocument.Pages.Accept(textFragmentAbsorber);
                var textFragmentCollection = textFragmentAbsorber.TextFragments;
                for (var index = 1; index <= textFragmentCollection.Count; index++)
                {
                    TextFragment textFragment = textFragmentCollection[index];
                    var textFragmentText = textFragment.Text.Trim().ToLower();
                    switch (textFragmentText)
                    {
                        case "copy":
                        case "of":
                        case "original":
                            textFragment.Text = string.Empty;
                            break;
                        default:
                            break;
                    }
                }

                _logger.Info($"          Save WOW pdf...");
                pdfDocument.Save(fiw.FullName);
            }

Loan Document - 00aaa7c0-62b6-4587-b2ec-849a942cfe4e-DEC-wow0_010.pdf (513.8 KB)
Loan Document - 00aaa7c0-62b6-4587-b2ec-849a942cfe4e-DEC-wow4_010.pdf (513.5 KB)

Without Out Watermark 0, page 10 is the unaltered document.

Without Out Watermark 4, page 10 is the altered document.

	(b) Fictitious Business Name. Borrower has filed or recorded all documents or filings required by law relating to all fictitious
	business names used by Borrower. The fictitious business names previously disclosed in , witrsi trienggi sttoe rLeedn sduecrcessors
	and assigns or any authorized agent thereof constitute a cofmicptilteioteu sl isbtu osifn aelsls names under which Borrower does

Notice how the "textFragment"s are moved left? That’s where COPY would be.

@wernercd

Thanks for getting back to us.

We have tested the scenario in our environment and were able to notice the issue. Hence, we have logged it as PDFNET-45897 in our issue tracking system for the sake of correction. We will further look into details of the issue and keep you posted with the status of its rectification. Please be patient and spare us little time.

We are sorry for the inconvenience.

1 Like

No problem at all, I appreciate the timely responses.