TextFragmentAbsorber not finding woirds that contain "fl"

bernieferencak · August 9, 2023, 12:41pm

I am trying to find and highlight keywords using Aspose.PDF for .NET using TextFragmentAbsorber. Everything works well except when the word contains the letter combination fl. Examples; rifle, Kerfuffle, fleet, flicked.I have found in addition the combination of “fi” is also not found.

sergei.shibanov · August 9, 2023, 2:23pm

@bernieferencak
Please provide the document and code snippet you are using.

bernieferencak · August 9, 2023, 3:18pm

THe function I am using:

public void HighlightPdfFileTest(string fileName, string searchtext)
{

  Aspose.Pdf.License licHighlightText = new Aspose.Pdf.License();
  licHighlightText.SetLicense("Aspose.Pdf.lic");

  // Load an existing PDF file in which you want to highlight text
  Document doc = new Document(ConfigurationManager.AppSettings["doc_dir"] + fileName);

  //Get the number of pages
  int numofpages = doc.Pages.Count();

  //get each word of the phrase'
  string[] words = searchtext.Split(' ');

  //Loop through all the words
  int totalwords = words.Count();

  for (int wordcount = 0; wordcount < totalwords; wordcount++)
  {
      if (words[wordcount].Length > 2)
      {
          for (int page = 1; page <= numofpages; page++)
          {

              // Search target text to highlight
              TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("(?i)\\b" + words[wordcount] + "\\b", new TextSearchOptions(true));
              //TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(searchtext);
              //TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("(?i)" + words[wordcount], new TextSearchOptions(true));
              //TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("[^a-z0-9]" + words[wordcount] + "[^a-z0-9]", new TextSearchOptions(true));

              doc.Pages[page].Accept(textFragmentAbsorber);
              int instancecount = textFragmentAbsorber.TextFragments.Count();
              if (instancecount > 0)
              {
                  for (int i = 1; i <= instancecount; i++)
                  {
                      // Create a highlight annotation
                      HighlightAnnotation ha = new HighlightAnnotation(doc.Pages[page], textFragmentAbsorber.TextFragments[i].Rectangle);

                      // Specify highlight color 
                      ha.Color = Aspose.Pdf.Color.Yellow;

                      // Add annotation to highlight text in PDF 
                      doc.Pages[page].Annotations.Add(ha);

                  }
              }
          }
      }
  }

  // Save the document 
  doc.Save(ConfigurationManager.AppSettings["doc_dir"] + "\\testdocwhithlight.pdf");

}

Various documents with the issue:

Fails when searching for:
rifle => BuickPDF.pdf
rifle => Romantics copy2PDF.pdf
kerfuffle => Emily DickinsonPDF.pdf
fleet => Defence of Fort McHenryPDF.pdf
flute => Hamlen BrookPDF3.pdf
flicked => Hamlen BrookPDF3.pdf

Works for:

immaculate => BuickPDF.pdf
apprehensive => Romantics copy2PDF.pdf
scholarly kerfuffle = Emily DickinsonPDF.pdf (finds scholarly only)

BuickPDF.pdf (43.2 KB)
Defence of Fort McHenryPDF.pdf (53.8 KB)
Emily DickinsonPDF.pdf (79.0 KB)
Hamlen BrookPDF3.pdf (61.1 KB)
Romantics copy2PDF.pdf (43.0 KB)

sergei.shibanov · August 9, 2023, 4:19pm

@bernieferencak
What is the searchtext variable in the above code snippet?
As I understand you use regular expressions to search for the required text?
Then it is done differently, the example code looks like this:

Document pdfDocument = new Document("input.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

string regexPattern = @"YourRegularExpressionHere";

TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textSearchOptions.RegularExpressionUsed = true;
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
pdfDocument.Pages.Accept(textFragmentAbsorber);

foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
{
    Console.WriteLine("Found text: " + textFragment.Text);
}

Console.WriteLine("Text search complete.");

bernieferencak · August 9, 2023, 4:32pm

These are the words searched and the corresponding documents which I attached.

Various documents with the issue:

Fails when searching for:
rifle => BuickPDF.pdf
rifle => Romantics copy2PDF.pdf
kerfuffle => Emily DickinsonPDF.pdf
fleet => Defence of Fort McHenryPDF.pdf
flute => Hamlen BrookPDF3.pdf
flicked => Hamlen BrookPDF3.pdf

Works for:

immaculate => BuickPDF.pdf
apprehensive => Romantics copy2PDF.pdf
scholarly kerfuffle = Emily DickinsonPDF.pdf (finds scholarly only)

bernieferencak · August 9, 2023, 6:13pm

I simplified the code. I took out the regex and searched one document for several words. It finds all the words unless they contain “fl”.

searchtext = “flames Dickinson into fly scholarly kerfuffle”;

Aspose.Pdf.License licHighlightText = new Aspose.Pdf.License();
licHighlightText.SetLicense(“Aspose.Pdf.lic”);

// Load an existing PDF file in which you want to highlight text
Document doc = new Document(“Emily DickinsonPDF.pdf”);

//Get the number of pages
int numofpages = doc.Pages.Count();

//get each word of the phrase’
string[] words = searchtext.Split(’ ');

//Loop through all the words
int totalwords = words.Count();

for (int wordcount = 0; wordcount < totalwords; wordcount++)
{
if (words[wordcount].Length > 2)
{
for (int page = 1; page <= numofpages; page++)
{

        // Search target text to highlight
        TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(words[wordcount]);

        doc.Pages[page].Accept(textFragmentAbsorber);
        int instancecount = textFragmentAbsorber.TextFragments.Count();
        if (instancecount > 0)
        {
            for (int i = 1; i <= instancecount; i++)
            {
                // Create a highlight annotation
                HighlightAnnotation ha = new HighlightAnnotation(doc.Pages[page], textFragmentAbsorber.TextFragments[i].Rectangle);

                // Specify highlight color 
                ha.Color = Aspose.Pdf.Color.Yellow;

                // Add annotation to highlight text in PDF 
                doc.Pages[page].Annotations.Add(ha);

            }
        }
    }
}

}

// Save the document
doc.Save(“testdocwhithlight.pdf”);

sergei.shibanov · August 10, 2023, 7:26am

@bernieferencak
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55258

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

sergei.shibanov · August 10, 2023, 7:30am

@bernieferencak
Thanks for the given data and more simplified code. I understood you, the problem was reproduced and I set the task for the development team.
(with simpler code, but mentioning that fuller comments can be found on the forum)

var doc = new Document(myDir + "Emily DickinsonPDF.pdf");
var textFragmentAbsorber = new TextFragmentAbsorber("flames");  // or "fly"
doc.Pages.Accept(textFragmentAbsorber);
int instancecount = textFragmentAbsorber.TextFragments.Count;
Console.WriteLine($"instancecount = {instancecount}");  // instancecount == 0

bernieferencak · October 25, 2023, 2:26pm

Any ideas on this? I need to implement a solution ASAP and was going to purchase a version for my client to use but if this will not work I will have to go in another direction.

Thanks

sergei.shibanov · October 25, 2023, 4:01pm

@bernieferencak
I will check the status with the development team and write to you.

sergei.shibanov · November 1, 2023, 7:32am

@bernieferencak
The development team wrote about this problem.

We investigated the issue and found no problems with the document or our library. The reason you can’t find words containing “fl” is because the fonts in the documents have MacRomanEncoding and there are no words containing the characters ‘f’ (102) and ‘l’ (108). Instead, the words have the symbol ‘fl’ (64258). Use this single-character Unicode 64528 (‘fl’) to search for words instead of the separate 102 (‘f’) and 108 (‘l’) character codes.
I’ve attach a code snippet which finds words containing fl.

var doc = new Document(input);
var searchPhrase = "ﬂames";
var textFragmentAbsorber = new TextFragmentAbsorber(searchPhrase);
doc.Pages.Accept(textFragmentAbsorber);

bernieferencak · January 11, 2024, 3:38pm

Would you have an example of using the TextFragmentAbsorber to search for unicode characters? I cannot find that in any of the documentation.

sergei.shibanov · January 11, 2024, 4:06pm

@bernieferencak

var searchPhrase = new string(new[] { (char)64258, (char)97, (char)109, (char)101, (char)115});

var codes = "flames".ToCharArray();
var codes1 = "ﬂames".ToCharArray();