Hi Team,
we are trying to search for text in a pdf and text contains the “.” in it. We are using the regular expression to search the string. the search string exists in the pdf but the textabsorber returns zero counts. Please find the below code we have used it.
TextSearchOptions op1 = new TextSearchOptions(true);
op1.UseFontEngineEncoding = true;
op1.IgnoreResourceFontErrors = true;
op1.SearchForTextRelatedGraphics = false;
op1.IgnoreShadowText = true;
// op1.IsRegularExpressionUsed = true;
string searchText = citation.CitationText.Trim();
searchText = searchText.Replace(“(”, @“(”).Replace(“)”, @“)”);
searchText = searchText.Replace(“[”, @“[”).Replace(“]”, @“]”);
searchText = searchText.Replace(“?”, @“?”).Replace(“+”, @“+”);
searchText = searchText.Replace(“$”, @“$”);
searchText = searchText.Replace(“-”, @“-”).Replace(“^”, @“^”);
searchText = searchText.Replace(“{”, @“{”).Replace(“}”, @“}”);
searchText = searchText.Replace(“|”, @“|”).Replace(““, @”*“);
searchText = searchText.Replace(” “, @”\s”).Replace(“\n”, @“\s*”);
searchText = searchText.Replace(“.”, @“.”);
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(searchText, op1);
document.Pages[citation.ReportPageNumber ?? 0].Accept(textFragmentAbsorber);
if (textFragmentAbsorber.TextFragments.Count() == 0)
{
Console.Writeline(“Not found”)
}
@CRAsposeUser
Cause The issue you’re encountering with searching for text that contains a period (.) in a PDF document using Aspose.PDF may be related to how the regular expression is constructed or how the search options are configured. The period is a special character in regular expressions, which means it can match any character unless properly escaped. Solution To ensure that the search string is correctly interpreted, you should enable the use of regular expressions in your TextSearchOptions
and ensure that the period is properly escaped. Additionally, consider enabling the dotall mode if you want the period to match newline characters as well. Here’s how you can modify your code: csharp TextSearchOptions op1 = new TextSearchOptions(true); // Enable regular expressions op1.UseFontEngineEncoding = true; op1.IgnoreResourceFontErrors = true; op1.SearchForTextRelatedGraphics = false; op1.IgnoreShadowText = true; op1.setDotallMode(true); // Enable dotall mode if needed string searchText = citation.CitationText.Trim(); searchText = Regex.Escape(searchText); // Escape all special characters in the search text TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(searchText, op1); document.Pages[citation.ReportPageNumber ?? 0].Accept(textFragmentAbsorber); if (textFragmentAbsorber.TextFragments.Count() == 0) { Console.WriteLine("Not found"); }
Next Steps 1. Ensure that the citation.CitationText
variable contains the exact text you are trying to search for, including the period.
2. Test the modified code to see if it successfully finds the text in the PDF.
3. If the issue persists, check the PDF content to ensure that the text is not represented in a different format or encoding. Please implement these changes and let me know if you need further assistance!
but the method is not available in dot-net. Can you confirm again.
@CRAsposeUser
Would you please share some more details like which method are you searching for to search text? Also, please share your sample PDF document along with the information of the text that you want to extract. We will test the scenario in our environment and address it accordingly.