Hi Team,
Hi Navaneethan,n.b.vijayakumar@accenture.com:Hi Team,I need help in searching the text in a pdf. I am looking for regular expression for searching the following scenarios in PDF.Can you please help me in sending the regular expression for the scenarios.Search Keyword1 : “the”Search Keyword2 : “axe”Scenario 1 :All the words starting with “the” should be returned without checking for caseScenario 2:All the words starting with “the” should be returned without checking for case
//open document<o:p></o:p>
Document pdfDocument = new Document("c:/pdftest/draftAuthorization.pdf");
//create TextAbsorber object to find all the phrases matching the regular expression
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(?i) the", new TextSearchOptions(true));
//set text search option to specify regular expression usage
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
//accept the absorber for all the pages
pdfDocument.Pages.Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
Console.WriteLine("Text : {0} ", textFragment.Text);
Console.WriteLine("Position : {0} ", textFragment.Position);
}
Hi Navaneethan,
Thanks for your patience.
We have further looked into your requirements and in addition to earlier shared details, following steps need to be followed to accomplish your requirements. Please avoid using expression @“(?i) the” to find all the words starting with “the”. This approach is not appropriate because it will not find ‘The’ on the line beginning. And also it returns space+’the’ not a word starting with “the” . See: Expression_0.png .
Please use:
- @“(?i)\bXXX\w*\b” for all words starting with “XXX” (case insensitive). Scenario #1 (#2) See: Expression_1_2.png
- @“(?i)\b\wXXX\w\b” for all words containing “XXX” substring. See: Expression_3.png
- @"(?i)\bXXX\wYYY\b" for all words starting with “XXX” and ending with “YYY”. See: Expression_4.png
Explanation:
- \b - matches a zero-width boundary between a word-class character and either a non-word class character or an edge.
- \w - matches an word-class character.
-
- matches the preceding pattern element zero or more times.
For more information, please visit Regular expression - Wikipedia , https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
.