Pdf search using regular expression

Hi Team,

I need help in searching the text in a pdf. I am looking for regular expression for searching the following scenarios in PDF.Can you please help me in sending the regular expression for the scenarios.

Search Keyword1 : "the"
Search Keyword2 : "axe"

Scenario 1 :
All the words starting with “the” should be returned without checking for case

Scenario 2:
All the words starting with “the” should be returned without checking for case

Scenario 3:
All the words containing the word “the” should be returned without checking for case


Scenario 3:
All the words starting with the word “the” and ending with “axe” should be returned without checking for case

n.b.vijayakumar@accenture.com:
Hi Team,
I need help in searching the text in a pdf. I am looking for regular expression for searching the following scenarios in PDF.Can you please help me in sending the regular expression for the scenarios.

Search Keyword1 : “the”
Search Keyword2 : “axe”

Scenario 1 :
All the words starting with “the” should be returned without checking for case

Scenario 2:
All the words starting with “the” should be returned without checking for case
Hi Navaneethan,

Thanks for contacting support.

In order to accomplish above stated requirement, please try using following code snippet.

[C#]

//open document<o:p></o:p>

Document pdfDocument = new Document("c:/pdftest/draftAuthorization.pdf");

//create TextAbsorber object to find all the phrases matching the regular expression

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(?i) the", new TextSearchOptions(true));

//set text search option to specify regular expression usage

TextSearchOptions textSearchOptions = new TextSearchOptions(true);

textFragmentAbsorber.TextSearchOptions = textSearchOptions;

//accept the absorber for all the pages

pdfDocument.Pages.Accept(textFragmentAbsorber);

//get the extracted text fragments

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

//loop through the fragments

foreach (TextFragment textFragment in textFragmentCollection)

{

Console.WriteLine("Text : {0} ", textFragment.Text);

Console.WriteLine("Position : {0} ", textFragment.Position);

}


Concerning to other two scenarios, we are working on creating the required code snippet and will get back to you soon.

Hi Navaneethan,

Thanks for your patience.

We have further looked into your requirements and in addition to earlier shared details, following steps need to be followed to accomplish your requirements. Please avoid using expression @“(?i) the” to find all the words starting with “the”. This approach is not appropriate because it will not find ‘The’ on the line beginning. And also it returns space+’the’ not a word starting with “the” . See: Expression_0.png .

Please use:

  1. @“(?i)\bXXX\w*\b” for all words starting with “XXX” (case insensitive). Scenario #1 (#2) See: Expression_1_2.png
  2. @“(?i)\b\wXXX\w\b” for all words containing “XXX” substring. See: Expression_3.png
  3. @"(?i)\bXXX\wYYY\b" for all words starting with “XXX” and ending with “YYY”. See: Expression_4.png

Explanation:

  • \b - matches a zero-width boundary between a word-class character and either a non-word class character or an edge.
  • \w - matches an word-class character.
    • matches the preceding pattern element zero or more times.

For more information, please visit Regular expression - Wikipedia , https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx.