Search text in PDF document

PriyankaShelke · December 3, 2018, 4:59pm

Hi,

I have to search a text in the document.
I use foll code -
var tfa = new TextFragmentAbsorber( parameterTitle.Text + " "); // parameterTitle is Textfragment
tfa.TextSearchOptions = new TextSearchOptions(true);
page.Accept(tfa);
if (tfa.TextFragments.Count > 0)
{ // do the logic}

I have a huge document , more than 100 pages.
Sometimes the text is searched sometimes it is not in the pages . The loop for pages is done from page 1 to last document of the page.
The text to be searched is “P3” it also gives results of “P31” “P33” which is not what I expect.
Is there any way I can get exact text that I am searching.
I used regex strings “^”+“P3” +"$" but then “P3” was also not found. I am not clear how the regex works

Farhan.Raza · December 3, 2018, 10:17pm

@PriyankaShelke

Thank you for contacting support.

Would you please share source PDF document with us along with the value of parameterTitle and specify sample text from which page number do you want to search.

Moreover, you may design and test a regular expression on some online utility and then use verified expression with Aspose.PDF for .NET API as per your requirements. Below is a sample code snippet using a regular expression.

// open document
Document pdfDocument = new Document(myDir + @"Test.pdf");
// create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"\d{4}(\r\n)?-(\r\n)?\d{4}"); //like 1999-2000
// set text search option to specify regular expression usage
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
// accept the absorber for all the pages
pdfDocument.Pages.Accept(textFragmentAbsorber);
// get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
Console.WriteLine(textFragmentCollection.Count);
// loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
    // update text and other properties
    textFragment.Text = "xxxx-xxxx";
}
pdfDocument.Save(myDir + @"Test_out.pdf");

Moreover, you may also visit Search and Get Text from all pages using Regular Expression for your kind reference.

Jackson94 · January 6, 2022, 2:42am

G’day
I have tried this, whilst it works with bolded text which is a heading, it fails to search text within a paragraph.
Can you please shed some light on how to find text within a paragraph?
thanks

asad.ali · February 9, 2022, 6:09pm

@Jackson94

Have you tried changing the regular expression in the above give code snippet? The searching may get limited due to the regular expressions used. If you want to search for the complete text, you can try initializing the TextFragmentAbsorber with an empty constructor. In case the issue still persists, please share your sample source PDF file with us along with some details of your actual requirements. We will test the scenario in our environment and address it accordingly.