Read each line in PDF , Check if line matches a specific regular expression , if regular expression matches then update the same text

afsal.akbarsha · December 27, 2018, 10:45am

Hi ,
Requirement : i need to read each line in PDF , then check the line matches a regular expression
if the **regular expression matches then update the text ** .

please check the code below , i am able to find the text using regular expression but
I am unable to REMOVE the TEXT and save it in pdf

var extractOption = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
var textAbsorber = new TextAbsorber(extractOption);
pdfDocument.Pages.Accept(textAbsorber);

            var extractedtext = textAbsorber.Text;

            var pdfTextContents = extractedtext.Split('\n');
            var _pattern = @"^[0-9].*$";
            var lineNumber = 0;
            var prevLineNumber = 0;

            foreach (var content in pdfTextContents)
            {
                var result = Regex.Match(content, _pattern);
                 
                  if(result.Success){
                     // REMOVE THE TEXT FROM PDF
                  }
        //SAVE IT AS UPDATE
            }

Farhan.Raza · December 27, 2018, 7:22pm

@afsal.akbarsha

Thank you for contacting support.

Would you please share source PDF document with us along with the code snippet including how you are removing the text from PDF file. You may ensure the regular expression is working fine, using any online RegEx testing utility. Moreover, you may visit Search and Get Text from all pages using Regular Expression for your kind reference.

afsal.akbarsha · December 28, 2018, 4:47am

HRemove_line_Properly_Test_doc.pdf (382.0 KB)

Hi ,
@Farhan.Raza
Thanks for the suggestion,
but i guess " [Search and Get Text from all pages using Regular Expression]" ,
will not work for my scenario , we are having some validation rules to check before removing line numbers

My original Requirement is to Check whether the PDF is having Line number or not
if line number is present in PDF then i need to remove line number.
Please check the sample document attached.

Technically first i need to validate the PDF document ,
To make sure PDF is having valid Line Numbers

validation rules are
— > Line Number should be In Ascending order
----> it is the First element of each line
----> for Header and footer section there is no line number
----> check all pages are having line number
-----> No separate line numbers for each pages

If above validate rule is success then i need to remove line number

Here i am taking each line and
checking whether line number is valid
if it is valid
then i need to remove the line number
and save the PDF

please check the code Below

private static bool IsValidThenRemoveLineNumbers(Document pdfDocument) {

var extractOption = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
var textAbsorber = new TextAbsorber(extractOption);
pdfDocument.Pages[0].Accept(textAbsorber);

var extractedtext = textAbsorber.Text;
var pdfTextContents = extractedtext.Split(’\n’);
var _pattern = @"^[ ]*\d+";
var lineNumber = 0;
var prevLineNumber = 0;
var isValid = false;

foreach (var content in pdfTextContents)
{

// if only null Element or white space then Ignore the line
if (string.IsNullOrWhiteSpace(content))
{
continue;
}

// Check Line Number exist In the Line
var result = Regex.Match(content, _pattern);

// Ignore Header And Footer
if (!result.Success && lineNumber > 0)
{
return false;
}

//Remove LineNumber If Valid
if (result.Success)
{
if (ConvertToInt(out lineNumber, result.Value))
{
// Return Invalid if line number is not in Ascending Order
if (lineNumber == prevLineNumber + 1)
{
isValid = true;
prevLineNumber = lineNumber;
// Require a Solution to remove line number
}
else
{
return false;
}
}
}

}
return isValid;
}

if Valid then Save the PDF

**Here after validating the line **
i need to remove the line - i didn’t find a solution to remove the text
please suggest a solution to remove the line number

regards

Farhan.Raza · December 28, 2018, 3:39pm

@afsal.akbarsha

We have devised a demonstration of how text or line numbers can be removed from a PDF document. Generated PDF file has been attached for your kind reference Test_18.12.pdf. Please try using below code snippet in your environment:

Document pdfDocument = new Document(dataDir + "Remove_line_Properly_Test_doc.pdf");
string pattern = @"(?<=\s)\d+(?=\s*$)";
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(pattern);
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;            
pdfDocument.Pages.Accept(textFragmentAbsorber);
int number = 0;
int prevNumber = 0;
foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
{
    // Check to skip page numbers
    if (int.TryParse(textFragment.Text, out number)) //  && number > prevNumber
    {
        textFragment.Text = "";
        //prevNumber = number;
    }
}
pdfDocument.Save(dataDir + "Remove_line_Properly_Test_18.12.pdf");

However, the line numbers in source file are not in order as 49 appears after 51. You may use commented code if line numbers are in sequence to get precise results, or you may make any modifications as per your requirements.

afsal.akbarsha · January 2, 2019, 12:47pm

Hi ,
@Farhan.Raza
Thanks for the suggestion,

I would like to know , how to identify Latex Type Equations From pdf Document
and replace with some other text.
please check pdf sample attached
Manuscript_new_0010.pdf (391.7 KB)

Farhan.Raza · January 2, 2019, 8:02pm

@afsal.akbarsha

We are afraid Latex text may not be differentiated from other text in a PDF document. However, a ticket with ID PDFNET-45872 has been logged in our issue management system for further investigations and resolution. We will investigate if Latex fragments can be extracted and replaced as per your requirements. Moreover, you may devise some regular expression if that may work for your scenario.