Remove line numbers from PDF file using Aspose.PDF for .NET

Hello,

in the scenario in which we have a PDF generated from an editable version (Word, LaTeX, … ) that includes line numbers, is it possible to remove the line numbers from that PDF document using Aspose.PDF for .NET?

Thanks

Best regards,

@stefano.giannone.frontiers

Thanks for contacting support.

In order to test the scenario and provide our feedback, we need sample PDF document. Would you please share your PDF document with us. This would help us understanding your requirements and assist you accordingly.

Latex_Sample.pdf (89.4 KB)
Word_Sample.pdf (127.2 KB)

Here you have the two sample PDF with line numbers. One is generated starting from a Word file, and the other one starting from a LaTeX file.

We need to strip out the line numbers from both, but we don’t have the source files. So, if possible, we would like to do that starting from the resulting PDF file itself.

Thanks

@stefano.giannone.frontiers

Thanks for sharing sample PDF documents.

We have tried to achieve your requirements using following code snippet but did not get much success as output PDF file had formatting issues. For your kind reference, following is the code that we tried along with generated output:

var startEnd = ".+";
var textFragmentAbsorber = new TextFragmentAbsorber(startEnd);
var textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
Document pdfDocument = new Document(dataDir + "Word_Sample.pdf");
pdfDocument.Pages.Accept(textFragmentAbsorber);

var textFragmentCollection = textFragmentAbsorber.TextFragments;
var count = textFragmentCollection.Count;
foreach (TextFragment textFragment in textFragmentCollection)
{
  if (textFragment.Text.Trim() != "")
  {
   string lineNumber = textFragment.Text.TrimEnd().Substring(textFragment.Text.TrimEnd().Length - 1);
   if (int.TryParse(lineNumber, out int n))
   {
    textFragment.Text = textFragment.Text.TrimEnd().Replace(lineNumber, "");
   }
  }
}
pdfDocument.Save(dataDir + "test18.12.out.pdf"); 

test18.12.out.pdf (194.6 KB)

We have logged an investigation ticket as PDFNET-45791 in our issue tracking system for further investigation whether your requirements can be achieved or not. We will keep you posted with the updates about ticket resolution as soon as we have some. Please spare us little time.

We are sorry for the inconvenience.

That’s great. Thanks for your effort.

I’ll wait for a feedback from you.

Best regards,
Stefano

@stefano.giannone.frontiers

Thanks for your feedback.

We will surely let you know as soon as some additional updates are available.

Hello,

any update on this?

We’re waiting for this to decide if we should proceed buying a license of Aspose.PDF or not.

Thanks for your effort.

Best regards,
Stefano

@stefano.giannone.frontiers

Thanks for your inquiry.

I am afraid that earlier logged investigation ticket is not yet resolved due to previously logged pending issues in the queue. However, we have recorded your concerns and will definitely consider them during investigation. As soon as some definite updates are available we will let you know. Please spare us little time.

We are sorry for the inconvenience.

Hello,

thanks for the time you’re dedicating to this.

In the meanwhile, we would also try to see if we’re able to find a solution our-self, based on the sample code provided by you.

But actually, we can’t. This is because we receive an exception saying ‘At most 4 elements (for any collection) can be viewed in evaluation mode.’ (see attached screenshot).

Capture.PNG (18.9 KB)

Could you provide us with a way to check the code without this restriction? I don’t know if you can provide us with a temporary license key to use just for this investigation. Once we come with a solution we can buy a proper license.

And also, just to mention, we don’t care too much about the fact that the resulting PDF is loosing some formatting style. We do care more about the removal of the line numbers from the text of the PDF.

Thanks again for your support.

Best regards,
Stefano

@stefano.giannone.frontiers

Thanks for getting back to us.

You may please consider applying 30-days temporary license with which you can use API features without any restriction.

@stefano.giannone.frontiers

Thanks for your patience.

We have further investigated earlier logged ticket and found that your requirement can be met using following code snippet:

Remove Line Numbers from PDF document

string pattern = @"(?<=\s)\d+(?=\s*$)";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(pattern);
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;

Document pdfDocument = new Document(myDir + "Latex_Sample.pdf");
pdfDocument.Pages.Accept(textFragmentAbsorber);

int number = 0;
int prevNumber = 0;

foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
{
    // Check to skip page numbers
    if (int.TryParse(textFragment.Text, out number) && number > prevNumber)
    {
        textFragment.Text = "";
        prevNumber = number;
    }                    
}

pdfDocument.Save(myDir + "Latex_Sample_out.pdf"); 

Above code snippet has been tested with your both source documents and it worked fine. In case of any further assistance, please feel free to let us know.
Latex_Sample_out.pdf (89.4 KB)
Word_Sample_out.pdf (126.0 KB)

Hello,

we incorporated the code you provided us. It works well but the loop though the text fragments seems to be a bit slow (especially for large files).

Supposing that we don’t want to do any check (to see if it is a number or if the number is greater than the previous one); is there a faster approach to just replace with empty string all the regex matches (fragments text) without iterating through the fragments collection?

We need to make it as faster as possible.

Thanks

@stefano.giannone.frontiers

In order to remove text of all found text fragments, you can simply use textFragmentAbsorber.RemoveAllText() method. This way you would not need to iterate through all text fragments in TextFragmentCollection. However, please note that the condition to check if number is greater than previous one is added purposely so that no other number (which is not line number) should be removed.

Furthermore, instead using pdfDocument.Pages.Accept(textFragmentAbsorber);, you can absorb text page by page which will also help reducing time cost. In case you still face any issue, please feel free to let us know.

I tried to use textFragmentAbsorber.RemoveAllText(document), but it removes all the text, with a resulting blank page. This is not what I was expecting. It seems like it is not taking into consideration the regular expression.

After further analysis, we discovered that the slower piece of code is the Text property’ setter when we do:

textFragment.Text = "";

Indeed, I can see you’re doing a lot of things inside the setter, but I cannot understand very well, as the code is obfuscated when I try to decompile.

If you give me your email address I can share with you a PDF document for which the code you provided is taking more or less 26 minutes to be executed on my machine.This is not acceptable, as the PDF is not that big.

So maybe you can you try to execute the code against this PDF on your side and check the performance of the Text property’s setter.

Thanks a lot.

Best regards,
Stefano

@stefano.giannone.frontiers

Thanks for sharing these details.

You may please send your file in a private message. You can send a private message by clicking over username and press blue ‘Message’ button.