How to search multiple keywords in a PDF

Jackson94 · February 23, 2022, 2:25am

Is there a way to search for multiple keywords in a PDF or DOCUMENT?
example
Paris AND Jackson AND Nepal AND (Trophy OR Award)
the above is to search for presence of Paris AND Jackson AND Nepal in one document and either of the two, Trophy OR award and non-case sensitive ?

Thanks

tahir.manzoor · February 23, 2022, 11:00am

@Jackson94

Yes, you can search multiple keywords in PDF using Aspose.PDF. Please read the following article about searching text in PDF.
Search and Get Text from Pages of PDF

Jackson94 · February 23, 2022, 10:40pm

I visited the link but was unable to find a matching scenario.
Is there a specific example you could demonstrate to achieve the outcome ?

tahir.manzoor · February 24, 2022, 6:40am

@Jackson94

You can specify regular expressions in order to get multiline text. Aspose.PDF identifies the line break and space with the expression “\s*”. Please check following code snippet to extract your particular phrase from the PDF:

Document pdfDocument = new Document(dataDir + "sample.pdf");
foreach (Page page in pdfDocument.Pages)
{
 var textFragmentAbsorber = new TextFragmentAbsorber(@"just\s*for\s*use\s*in\s*the\s*Virtual\s*Mechanics\s*tutorials.\s*More\s*text.\s*And\s*more\s*text\b");
 var textSearchOptions = new TextSearchOptions(true);
 textFragmentAbsorber.TextSearchOptions = textSearchOptions;
 page.Accept(textFragmentAbsorber);
 var textFragmentCollection = textFragmentAbsorber.TextFragments;
 // Perform other stuff
}

Jackson94 · February 28, 2022, 7:28am

Multi line text implies a complete sentence separated by \s*.

If I want to search words on a page, example “text” and also “sample” to appear in a document explicitly and then only flag our the response ?
“text” can appear and also “sample” to appear on the same document and to flag out
can that be achieved using \s*.

Pl confirm ?

tahir.manzoor · February 28, 2022, 4:37pm

@Jackson94

You can use TextFragmentAbsorber constructor (Regex) and use a regex according to your requirement. You can find multiple keywords from the PDF. Following code example shows how to use it.

var regex = @"... regex for multiple keywords ";

Document pdfDocument = new Document(dataDir + "input.pdf");

Page page = pdfDocument.Pages[1];
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(regex);
var textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
page.Accept(textFragmentAbsorber);
var textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (var textFragment in textFragmentCollection)
{
    Console.WriteLine(textFragment.Text);
}

If you still face problem, please attach your input PDF and expected output here for our reference. We will then provide you more information about your query.