Multi Line Highlight Gets Extra Text

The code in c# below gets the inner text of a highlighted annotation in a pdf but if the highlight is across two lines it picks up extra text that is not needed. For Example, if I highlight text at the end of a line and beginning of the following line, it seems to grabbing all of the text below the highlighted first line on the second line and all of the text above what is highlighted on the second line on the first line. It appears to grab text from a rectangle. Is there any work around to get this to work?

if (currentAnnotation is HighlightAnnotation highlightAnnotation)
{
var commentText = currentAnnotation.Contents;
// Create a TextAbsorber to extract text fragments that intersect with the annotation’s rectangle
TextAbsorber absorber = new TextAbsorber();
absorber.TextSearchOptions = new TextSearchOptions(true); // Enable regular expression search
absorber.TextSearchOptions.Rectangle = highlightAnnotation.Rect;

            // Accept the absorber on the page to get the text fragments
            page.Accept(absorber);

            // Get the extracted text from the absorber
            var innerText = absorber.Text;
            var replaceTextAnnotation = new ReplaceTextAnnotation
            {
                InnerText = "Inner Text: " + innerText,
                Contents = "Comment Text: " + commentText,
                AnnotationType = currentAnnotation.GetType().ToString()
            };
            annotations.Add(replaceTextAnnotation);
        }

@three30

Would you kindly share your sample PDF document for our reference as well? We will test the scenario in our environment and address it accordingly.

HiglightTest.pdf (55.5 KB)
Here is the full code:
SetLicenseExample();
Document document = new Document(“filepath.pdf”);
PdfAnnotationEditor annotationEditor = new PdfAnnotationEditor();
annotationEditor.BindPdf(document);
var annotationTypes = new[] { AnnotationType.StrikeOut, AnnotationType.Highlight, AnnotationType.FreeText, AnnotationType.Caret, AnnotationType.Text };
var annotations = new List();
foreach (Page page in document.Pages)
{
var pageAnnotations = annotationEditor.ExtractAnnotations(page.Number, page.Number, annotationTypes);

for (int i = 0; i < pageAnnotations.Count; i++)
{
    var currentAnnotation = pageAnnotations[i];
    if (currentAnnotation is HighlightAnnotation highlightAnnotation)
    {
        var commentText = currentAnnotation.Contents;
        // Create a TextAbsorber to extract text fragments that intersect with the annotation’s rectangle
        TextAbsorber absorber = new TextAbsorber();
        absorber.TextSearchOptions = new TextSearchOptions(true); // Enable regular expression search
        absorber.TextSearchOptions.Rectangle = highlightAnnotation.Rect;

        // Accept the absorber on the page to get the text fragments
        page.Accept(absorber);

        // Get the extracted text from the absorber
        var innerText = absorber.Text;
        var replaceTextAnnotation = new ReplaceTextAnnotation
        {
            InnerText = "Inner Text: " + innerText,
            Contents = "Comment Text: " + commentText,
            AnnotationType = currentAnnotation.GetType().ToString()
        };
        Console.WriteLine(innerText);
        //Console.WriteLine(commentText);

        annotations.Add(replaceTextAnnotation);
    }
}

}

@three30

We were unable to resolve ReplaceTextAnnotation in your code snippet. Can you please share which API version are you using? Can you please share the missing definitions so that we can proceed with the testing?

Im using aspose.pdf nuget package 23.7.0, liscenseVersion 3.0, let me know if there is any more information you need.
Not all of the code is in the code snippet some is part of the original comment

SetLicenseExample();
Document document = new Document(“filepath.pdf”);
PdfAnnotationEditor annotationEditor = new PdfAnnotationEditor();
annotationEditor.BindPdf(document);
var annotationTypes = new[] { AnnotationType.StrikeOut, AnnotationType.Highlight, AnnotationType.FreeText, AnnotationType.Caret, AnnotationType.Text };
var annotations = new List();
foreach (Page page in document.Pages)
{
var pageAnnotations = annotationEditor.ExtractAnnotations(page.Number, page.Number, annotationTypes);


for (int i = 0; i < pageAnnotations.Count; i++)
{
    var currentAnnotation = pageAnnotations[i];
    if (currentAnnotation is HighlightAnnotation highlightAnnotation)
    {
        var commentText = currentAnnotation.Contents;
        // Create a TextAbsorber to extract text fragments that intersect with the annotation’s rectangle
        TextAbsorber absorber = new TextAbsorber();
        absorber.TextSearchOptions = new TextSearchOptions(true); // Enable regular expression search
        absorber.TextSearchOptions.Rectangle = highlightAnnotation.Rect;

        // Accept the absorber on the page to get the text fragments
        page.Accept(absorber);

        // Get the extracted text from the absorber
        var innerText = absorber.Text;
        var replaceTextAnnotation = new ReplaceTextAnnotation
        {
            InnerText = "Inner Text: " + innerText,
            Contents = "Comment Text: " + commentText,
            AnnotationType = currentAnnotation.GetType().ToString()
        };
        Console.WriteLine(innerText);
        //Console.WriteLine(commentText);

        annotations.Add(replaceTextAnnotation);
    }
}


}

The most recent comment is the code

Thank you in advance

@three30

We were unable to resolve ReplaceTextAnnotation Class. Can you please share its reference or its name with the complete namespace?

public class ReplaceTextAnnotation

{
public string Contents { get; set; }
public string InnerText { get; set; }
public string AnnotationType { get; set; }
// Add any other properties you want to store for the ReplaceText annotation.
}

I apologize i did not realize that i did not include that. Thank you again.

@three30

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55155

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Hi @asad.ali . Thank you again for taking the time to look at this issue. Were you able to recreate an instance of this problem? Also, do you have an eta of when this problem may be resolved?

@three30

Yes, we were able to notice the issue in our environment while using Aspose.PDF for .NET 23.7. Therefore, an issue is logged for the rectification. We are afraid that we do not have any ETA information at the moment because the ticket is not yet investigated. We will look into its details on a first come first serve basis and let you know once we make some progress in this regard. Please spare us some time.

We are sorry for the inconvenience.