I am trying to find and highlight keywords using Aspose.PDF for .NET using TextFragmentAbsorber. Everything works well except when the word contains the letter combination fl. Examples; rifle, Kerfuffle, fleet, flicked.I have found in addition the combination of “fi” is also not found.
THe function I am using:
public void HighlightPdfFileTest(string fileName, string searchtext)
{
Aspose.Pdf.License licHighlightText = new Aspose.Pdf.License();
licHighlightText.SetLicense("Aspose.Pdf.lic");
// Load an existing PDF file in which you want to highlight text
Document doc = new Document(ConfigurationManager.AppSettings["doc_dir"] + fileName);
//Get the number of pages
int numofpages = doc.Pages.Count();
//get each word of the phrase'
string[] words = searchtext.Split(' ');
//Loop through all the words
int totalwords = words.Count();
for (int wordcount = 0; wordcount < totalwords; wordcount++)
{
if (words[wordcount].Length > 2)
{
for (int page = 1; page <= numofpages; page++)
{
// Search target text to highlight
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("(?i)\\b" + words[wordcount] + "\\b", new TextSearchOptions(true));
//TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(searchtext);
//TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("(?i)" + words[wordcount], new TextSearchOptions(true));
//TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("[^a-z0-9]" + words[wordcount] + "[^a-z0-9]", new TextSearchOptions(true));
doc.Pages[page].Accept(textFragmentAbsorber);
int instancecount = textFragmentAbsorber.TextFragments.Count();
if (instancecount > 0)
{
for (int i = 1; i <= instancecount; i++)
{
// Create a highlight annotation
HighlightAnnotation ha = new HighlightAnnotation(doc.Pages[page], textFragmentAbsorber.TextFragments[i].Rectangle);
// Specify highlight color
ha.Color = Aspose.Pdf.Color.Yellow;
// Add annotation to highlight text in PDF
doc.Pages[page].Annotations.Add(ha);
}
}
}
}
}
// Save the document
doc.Save(ConfigurationManager.AppSettings["doc_dir"] + "\\testdocwhithlight.pdf");
}
Various documents with the issue:
Fails when searching for:
rifle => BuickPDF.pdf
rifle => Romantics copy2PDF.pdf
kerfuffle => Emily DickinsonPDF.pdf
fleet => Defence of Fort McHenryPDF.pdf
flute => Hamlen BrookPDF3.pdf
flicked => Hamlen BrookPDF3.pdf
Works for:
immaculate => BuickPDF.pdf
apprehensive => Romantics copy2PDF.pdf
scholarly kerfuffle = Emily DickinsonPDF.pdf (finds scholarly only)
BuickPDF.pdf (43.2 KB)
Defence of Fort McHenryPDF.pdf (53.8 KB)
Emily DickinsonPDF.pdf (79.0 KB)
Hamlen BrookPDF3.pdf (61.1 KB)
Romantics copy2PDF.pdf (43.0 KB)
@bernieferencak
What is the searchtext variable in the above code snippet?
As I understand you use regular expressions to search for the required text?
Then it is done differently, the example code looks like this:
Document pdfDocument = new Document("input.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
string regexPattern = @"YourRegularExpressionHere";
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textSearchOptions.RegularExpressionUsed = true;
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
pdfDocument.Pages.Accept(textFragmentAbsorber);
foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
{
Console.WriteLine("Found text: " + textFragment.Text);
}
Console.WriteLine("Text search complete.");
These are the words searched and the corresponding documents which I attached.
Various documents with the issue:
Fails when searching for:
rifle => BuickPDF.pdf
rifle => Romantics copy2PDF.pdf
kerfuffle => Emily DickinsonPDF.pdf
fleet => Defence of Fort McHenryPDF.pdf
flute => Hamlen BrookPDF3.pdf
flicked => Hamlen BrookPDF3.pdf
Works for:
immaculate => BuickPDF.pdf
apprehensive => Romantics copy2PDF.pdf
scholarly kerfuffle = Emily DickinsonPDF.pdf (finds scholarly only)
I simplified the code. I took out the regex and searched one document for several words. It finds all the words unless they contain “fl”.
searchtext = “flames Dickinson into fly scholarly kerfuffle”;
Aspose.Pdf.License licHighlightText = new Aspose.Pdf.License();
licHighlightText.SetLicense(“Aspose.Pdf.lic”);
// Load an existing PDF file in which you want to highlight text
Document doc = new Document(“Emily DickinsonPDF.pdf”);
//Get the number of pages
int numofpages = doc.Pages.Count();
//get each word of the phrase’
string[] words = searchtext.Split(’ ');
//Loop through all the words
int totalwords = words.Count();
for (int wordcount = 0; wordcount < totalwords; wordcount++)
{
if (words[wordcount].Length > 2)
{
for (int page = 1; page <= numofpages; page++)
{
// Search target text to highlight
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(words[wordcount]);
doc.Pages[page].Accept(textFragmentAbsorber);
int instancecount = textFragmentAbsorber.TextFragments.Count();
if (instancecount > 0)
{
for (int i = 1; i <= instancecount; i++)
{
// Create a highlight annotation
HighlightAnnotation ha = new HighlightAnnotation(doc.Pages[page], textFragmentAbsorber.TextFragments[i].Rectangle);
// Specify highlight color
ha.Color = Aspose.Pdf.Color.Yellow;
// Add annotation to highlight text in PDF
doc.Pages[page].Annotations.Add(ha);
}
}
}
}
}
// Save the document
doc.Save(“testdocwhithlight.pdf”);
@bernieferencak
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-55258
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
@bernieferencak
Thanks for the given data and more simplified code. I understood you, the problem was reproduced and I set the task for the development team.
(with simpler code, but mentioning that fuller comments can be found on the forum)
var doc = new Document(myDir + "Emily DickinsonPDF.pdf");
var textFragmentAbsorber = new TextFragmentAbsorber("flames"); // or "fly"
doc.Pages.Accept(textFragmentAbsorber);
int instancecount = textFragmentAbsorber.TextFragments.Count;
Console.WriteLine($"instancecount = {instancecount}"); // instancecount == 0
Any ideas on this? I need to implement a solution ASAP and was going to purchase a version for my client to use but if this will not work I will have to go in another direction.
Thanks
@bernieferencak
The development team wrote about this problem.
We investigated the issue and found no problems with the document or our library. The reason you can’t find words containing “fl” is because the fonts in the documents have MacRomanEncoding and there are no words containing the characters ‘f’ (102) and ‘l’ (108). Instead, the words have the symbol ‘fl’ (64258). Use this single-character Unicode 64528 (‘fl’) to search for words instead of the separate 102 (‘f’) and 108 (‘l’) character codes.
I’ve attach a code snippet which finds words containing fl.
var doc = new Document(input);
var searchPhrase = "flames";
var textFragmentAbsorber = new TextFragmentAbsorber(searchPhrase);
doc.Pages.Accept(textFragmentAbsorber);
Would you have an example of using the TextFragmentAbsorber to search for unicode characters? I cannot find that in any of the documentation.
var searchPhrase = new string(new[] { (char)64258, (char)97, (char)109, (char)101, (char)115});
var codes = "flames".ToCharArray();
var codes1 = "flames".ToCharArray();