Issue in Pdf Search

Hi Team

I am facing some issue in Pdf search .Search is not working for texts which are starting and ending with special characters.I am trying to search “whipped” it is giving me results but when i am trying to search “whipped!” it is not giving any results.This occurs with all the texts ending with special characters.I have attached the pdf file i used for testing.Below is the code i have used for search a word in a pdf.


Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@"(?i)\b" + searchKeyword.Trim() + @"\b");
//set text search option to specify regular expression usage
Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions = new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
//accept the absorber for all the pages
inputPdfDocument.Pages.Accept(textFragmentAbsorber);
//get the extracted text fragments
return textFragmentAbsorber.TextFragments;

Hi Navaneethan,


Thanks for contacting support.

I have gone through the PDF file shared earlier but I am unable to find string Whipped! inside it. However for the sake of testing, I have used following code snippet and I am able to get results when using latest release of Aspose.Pdf for .NET 11.6.0. Can you please share your code snippet, so that we can again try replicating this problem in our environment. We are sorry for your inconvenience.

[C#]

Document inputPdfDocument = new Document(“c:/pdftest/doc-cross-doc-hyperlinks.pdf”);<o:p></o:p>

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@"(?i)\b" + "pdf?" + @"\b");

//set text search option to specify regular expression usage

Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions = new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);

textFragmentAbsorber.TextSearchOptions = textSearchOptions;

//accept the absorber for all the pages

inputPdfDocument.Pages.Accept(textFragmentAbsorber);

//get the extracted text fragments

Console.WriteLine(textFragmentAbsorber.TextFragments.Count);

Below is the code snippet

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber(@"(?i)\b" + "whipped!" + @"\b");
//set text search option to specify regular expression usage
Aspose.Pdf.Text.TextOptions.TextSearchOptions textSearchOptions = new Aspose.Pdf.Text.TextOptions.TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
//accept the absorber for all the pages
inputPdfDocument.Pages.Accept(textFragmentAbsorber);
//get the extracted text fragments
return textFragmentAbsorber.TextFragments;

Hi Navaneethan,


Thanks for sharing the details.

I have tested the scenario using above stated code and I am unable to find any instance of keyword whipped!. However, I have also tried searching same keyword in Adobe Reader and I could not find any of its instance. Please take a look over attached image file.

Sorry team i have attached the wrong file i have attached the correct file now in my first post.

Hi Navaneethan,


Thanks for sharing the updated document.

I have tested the scenario and have managed to reproduce same problem. For the sake of correction, I have logged it as PDFNEWNET-40864 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

Hi Team,
Can you please let me know the status of the issue as there is a client follow up on the same.

Thanks,
Navaneethan V

Hi Navaneethan,


Thanks for your patience.

We do understand the criticallity of this problem but I am afraid the earlier reported issues are not yet resolved. However the product team will start reviewing/investigating them as per their development schedule and as soon as we have some further updates, we will let you know.

We are sorry for this delay and inconvenience.

Hi Navaneethan,

Thanks for your patience.

We have further investigated earlier reported issue and it appears to be a described behavior and not a bug.

The point is that expression ‘/b’ means ‘zero-width boundary between a word-class character (alphanumeric character) and either a non-word class character (space, punctuation) or an edge’. But ‘!’ character is not word-class character. It is punctuation character.

Therefore regex ‘(?i)\bwhipped\b’ matches word ‘whipped’ in the text ‘This man has been mercilessly whipped!’. But regex ‘(?i)\bwhipped!\b’ matches nothing.

If you need to find word with ending punctuation, you need remove last word boundary metacharacter. Or replace ‘zero-width word boundary’ metacharacter to ‘non-alphanumeric character’ metacharacter.

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(?i)\b" + “whipped!”);

or

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"(?i)\b" + “whipped” + @"\W");

It will find ‘whipped!’ in both cases.

Please refer https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx for more detailed information about regular expressions.