TextFragmentAbsorber doesn't find paragraph

charlie.lancaster · February 20, 2023, 4:20pm

Hey,

We are using the Pdf.Text.TextFragmentAbsorber to search and replace text. When we pass in a paragraph, it doesn’t find the text, but does work when we pass through the paragraph as single lines?

I have included the code that we are using below. We are using version Aspose.PDF for .NET 22.10

// Create the Aspose PDF document
Aspose.Pdf.Document pdfDoc = new Aspose.Pdf.Document(message.Documents[0].FilePath);

foreach (MessageDocumentPhrase phrase in message.Documents[0].Phrases)
{
  try
  {
    // Create the required objects
    TextFragmentAbsorber textFragmentAbsorber = (phrase.IsExpression) ?
    new TextFragmentAbsorber(new System.Text.RegularExpressions.Regex(phrase.SearchText), new Aspose.Pdf.Text.TextSearchOptions(true)) : // Is a regular expression  
    new TextFragmentAbsorber(Regex.Escape(phrase.SearchText), new Aspose.Pdf.Text.TextSearchOptions(true)); // Not a regular expression                     

    // Process the search
    pdfDoc.Pages.Accept(textFragmentAbsorber);

    TextFragmentCollection textFragments = textFragmentAbsorber.TextFragments;

    if (textFragments.Count != 0)
    {
      foreach (TextFragment textFragment in textFragments)
      {
        try
        {
          // Update text and other properties
          textFragment.Text = phrase.ReplacementText;
          textFragment.TextState.RenderingMode = TextRenderingMode.FillText;
          textFragment.TextState.Font = FontRepository.FindFont(phrase.Font);
          textFragment.TextState.FontSize = phrase.FontSize;
          textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.ColorTranslator.FromHtml(phrase.FontColour));
          textFragment.TextState.BackgroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.ColorTranslator.FromHtml(phrase.BackgroundColour));
          textFragment.TextState.Underline = phrase.Underline;
          if (phrase.WordSpacing != 0)
            textFragment.TextState.WordSpacing = phrase.WordSpacing;
          if (phrase.LineSpacing != 0)
            textFragment.TextState.LineSpacing = phrase.LineSpacing;
        }
        catch
        {
          throw new Exception($"Unable to apply text replacement: {phrase.ReplacementText}");
        }
      }
    }
  }
  catch (Exception ex)
  {
    _telemetry.TrackException(ex);
  }
}

// Apply the changes
pdfDoc.Save(message.Documents[0].FilePath);

carlos.molina · February 20, 2023, 6:27pm

@charlie.lancaster,

Can you please attach the document that is giving you this issue?

charlie.lancaster · February 21, 2023, 9:21am

Here’s the document in question, Paragraph was

“Our goal is to build an inclusive positive culture where everyone can feel comfortable being themselves, empowering our people to create their own high standards and therefore more value. We work together to promote fairness while recognising, valuing and embracing differences –providing a transparent support structure and generous training budget to help our people develop skills to progress their career. Our region also supports a hybrid model which can flex across a wide spectrum of working options determined by our business, customer and individual needs.”

Supporrt33418SearchAndReplaceParagraphs.pdf (331.0 KB)

carlos.molina · February 21, 2023, 2:28pm

@charlie.lancaster,

I made some code in c# and java that find your paragraph and do some annotations to it.

Keep in mind you had a problem in the string you were searching, you were missing a space after the dash.

Here is the code in C#:

private void Logic(Document doc)
{
    string paragraphContent = "Our goal is to build an inclusive positive culture where everyone can feel comfortable being themselves, empowering our people to create their own high standards and therefore more value. We work together to promote fairness while recognising, valuing and embracing differences – providing a transparent support structure and generous training budget to help our people develop skills to progress their career. Our region also supports a hybrid model which can flex across a wide spectrum of working options determined by our business, customer and individual needs.";
    string searchableContent = Regex.Replace(paragraphContent, " ", @"\s+");
    TextFragmentAbsorber absorber = new TextFragmentAbsorber(searchableContent, new TextSearchOptions(true));
    doc.Pages.Accept(absorber);

    foreach (var fragment in absorber.TextFragments)
    {
        string fragmentText = fragment.Text; // Just to see the content

        foreach (var textsegment in fragment.Segments)
        {
            string segmentText = textsegment.Text; // Just to see the content

            //Do Something
            UnderlineAnnotation underline = new UnderlineAnnotation(fragment.Page, textsegment.Rectangle);
            underline.Color = Color.Red;
            fragment.Page.Annotations.Add(underline);
        }

        // Do something
        HighlightAnnotation highlight = new HighlightAnnotation(fragment.Page, fragment.Rectangle);
        highlight.Color = Color.YellowGreen; 
        fragment.Page.Annotations.Add(highlight);
    }

       

    doc.Save($"{PartialPath}_output.pdf");
}

Here is the code in Java:

public void Logic(Document doc) throws Exception
{
    String paragraphContent = "Our goal is to build an inclusive positive culture where everyone can feel comfortable being themselves, empowering our people to create their own high standards and therefore more value. We work together to promote fairness while recognising, valuing and embracing differences – providing a transparent support structure and generous training budget to help our people develop skills to progress their career. Our region also supports a hybrid model which can flex across a wide spectrum of working options determined by our business, customer and individual needs.";
    String searchableContent = paragraphContent.replace(" ", "\\s+");
    TextFragmentAbsorber absorber = new TextFragmentAbsorber(searchableContent, new TextSearchOptions(true));
    doc.getPages().accept(absorber);

    for (var fragment : absorber.getTextFragments())
    {
        String fragmentText = fragment.getText(); // Just to see the content

        for (var segment : fragment.getSegments())
        {
            String segmentText = segment.getText(); // Just to see the content

            //Do Something
            UnderlineAnnotation underline = new UnderlineAnnotation(fragment.getPage(), segment.getRectangle());
            underline.setColor(Color.getRed());
            fragment.getPage().getAnnotations().add(underline);
        }

        // Do something
        HighlightAnnotation highlight = new HighlightAnnotation(fragment.getPage(), fragment.getRectangle());
        highlight.setColor(Color.getYellowGreen());
        fragment.getPage().getAnnotations().add(highlight);
    }

    doc.save(PartialPath + "_output.pdf");
}

Here is the input and output for the java version:
FindParagraph_input.pdf (331.0 KB)
FindParagraph_output.pdf (348.3 KB)

charlie.lancaster · February 22, 2023, 11:02am

I’ve just updated the search string and the function and I can now correctly find the paragraph. However, it seems the output isn’t correct, as there is some text floating off of the page… Any way around this? I have attached output below for reference, as well as what we are passing through as options

Font: “Arial”
FontSize: 12
FontColour: “#000000”
BackgroundColour: “#FFFFFF”
Underline: false
RenderingMode: FillText

Thanks,

Charlie

searchandreplace-boundarieserror.pdf (406.1 KB)

carlos.molina · February 22, 2023, 12:26pm

@charlie.lancaster,

Can you please share your code snipper please.

charlie.lancaster · February 22, 2023, 1:02pm

Its still the same code as above, but I’ve only changed how the input string is added to the TextFragmentAbsorber (Wouldn’t find the paragraph when we used the Regex.Escape() method)

So I am using the below (I am aware this will cause issues with Regex based searching, but I am just trying to get to the bottom of this bug before I refactor)

string searchableContent = Regex.Replace(phrase.SearchText, " ", @"\s+");


// Create the required objects
TextFragmentAbsorber textFragmentAbsorber = (phrase.IsExpression) ? 
new TextFragmentAbsorber(new System.Text.RegularExpressions.Regex(phrase.SearchText), new Aspose.Pdf.Text.TextSearchOptions(true)) : // Is a regular expression  
new TextFragmentAbsorber(searchableContent, new Aspose.Pdf.Text.TextSearchOptions(true)); // Not a regular expression

carlos.molina · February 22, 2023, 1:05pm

@charlie.lancaster,

I didn’t mean what to do to find it, but what do you do with it? The text font size seems different. Which is totally different from my example. So I am guessing you are doing some different that may be the cause of the issue. That is what and why I am asking for it,

Can you share the whole code snippet, please?

The text you found now it is:

While the original text was:

charlie.lancaster · February 22, 2023, 1:43pm

The objective of this function, is to allow users to Search and Replace text within a PDF Document. The output I have provided in my last post was what I just generated from the code in my first post (with the regex.escape() change replaced). With the search paragraph being overwritten by the replacement text.

In terms of the styling choices, these are just what the user has passed through to us. The sizing isn’t the issue here (as we can just tell the user to lower their font sizing). The issue is that the text doesn’t seem to respect the PDF’s margins and floats off of the page.

carlos.molina · February 22, 2023, 2:30pm

@charlie.lancaster,

It really matters because when editing an existing fragment you have a specific rectangle to fit, if it doesn’t fit, it breaks the width generating the issue. I am trying to figure out something to avoid this. Will post soon.

carlos.molina · February 22, 2023, 3:44pm

@charlie.lancaster
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-53737

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

charlie.lancaster · June 21, 2023, 9:06am

Hey Carlos,

Is there any updates on this issue? We have customers waiting for this

asad.ali · June 21, 2023, 6:49pm

@charlie.lancaster

We are afraid that the earlier logged ticket has not been yet resolved. Please note that it has already been escalated to the maximum priority and as soon as we make some progress towards its resolution, we will share with you. Please spare us some time.

We are sorry for the inconvenience.

aspose.notifier · November 19, 2023, 11:01pm

The issues you have found earlier (filed as PDFNET-53737) have been fixed in Aspose.PDF for .NET 23.11.