Multi Line Highlight Gets Extra Text

three30 · July 21, 2023, 12:54pm

The code in c# below gets the inner text of a highlighted annotation in a pdf but if the highlight is across two lines it picks up extra text that is not needed. For Example, if I highlight text at the end of a line and beginning of the following line, it seems to grabbing all of the text below the highlighted first line on the second line and all of the text above what is highlighted on the second line on the first line. It appears to grab text from a rectangle. Is there any work around to get this to work?

if (currentAnnotation is HighlightAnnotation highlightAnnotation)
{
var commentText = currentAnnotation.Contents;
// Create a TextAbsorber to extract text fragments that intersect with the annotation’s rectangle
TextAbsorber absorber = new TextAbsorber();
absorber.TextSearchOptions = new TextSearchOptions(true); // Enable regular expression search
absorber.TextSearchOptions.Rectangle = highlightAnnotation.Rect;

            // Accept the absorber on the page to get the text fragments
            page.Accept(absorber);

            // Get the extracted text from the absorber
            var innerText = absorber.Text;
            var replaceTextAnnotation = new ReplaceTextAnnotation
            {
                InnerText = "Inner Text: " + innerText,
                Contents = "Comment Text: " + commentText,
                AnnotationType = currentAnnotation.GetType().ToString()
            };
            annotations.Add(replaceTextAnnotation);
        }

asad.ali · July 21, 2023, 4:27pm

@three30

Would you kindly share your sample PDF document for our reference as well? We will test the scenario in our environment and address it accordingly.

three30 · July 21, 2023, 6:43pm

HiglightTest.pdf (55.5 KB)
Here is the full code:
SetLicenseExample();
Document document = new Document(“filepath.pdf”);
PdfAnnotationEditor annotationEditor = new PdfAnnotationEditor();
annotationEditor.BindPdf(document);
var annotationTypes = new[] { AnnotationType.StrikeOut, AnnotationType.Highlight, AnnotationType.FreeText, AnnotationType.Caret, AnnotationType.Text };
var annotations = new List();
foreach (Page page in document.Pages)
{
var pageAnnotations = annotationEditor.ExtractAnnotations(page.Number, page.Number, annotationTypes);

for (int i = 0; i < pageAnnotations.Count; i++)
{
    var currentAnnotation = pageAnnotations[i];
    if (currentAnnotation is HighlightAnnotation highlightAnnotation)
    {
        var commentText = currentAnnotation.Contents;
        // Create a TextAbsorber to extract text fragments that intersect with the annotation’s rectangle
        TextAbsorber absorber = new TextAbsorber();
        absorber.TextSearchOptions = new TextSearchOptions(true); // Enable regular expression search
        absorber.TextSearchOptions.Rectangle = highlightAnnotation.Rect;

        // Accept the absorber on the page to get the text fragments
        page.Accept(absorber);

        // Get the extracted text from the absorber
        var innerText = absorber.Text;
        var replaceTextAnnotation = new ReplaceTextAnnotation
        {
            InnerText = "Inner Text: " + innerText,
            Contents = "Comment Text: " + commentText,
            AnnotationType = currentAnnotation.GetType().ToString()
        };
        Console.WriteLine(innerText);
        //Console.WriteLine(commentText);

        annotations.Add(replaceTextAnnotation);
    }
}

}

asad.ali · July 21, 2023, 10:02pm

@three30

We were unable to resolve ReplaceTextAnnotation in your code snippet. Can you please share which API version are you using? Can you please share the missing definitions so that we can proceed with the testing?

three30 · July 24, 2023, 2:06pm

Im using aspose.pdf nuget package 23.7.0, liscenseVersion 3.0, let me know if there is any more information you need.
Not all of the code is in the code snippet some is part of the original comment

three30 · July 24, 2023, 2:07pm

SetLicenseExample();
Document document = new Document(“filepath.pdf”);
PdfAnnotationEditor annotationEditor = new PdfAnnotationEditor();
annotationEditor.BindPdf(document);
var annotationTypes = new[] { AnnotationType.StrikeOut, AnnotationType.Highlight, AnnotationType.FreeText, AnnotationType.Caret, AnnotationType.Text };
var annotations = new List();
foreach (Page page in document.Pages)
{
var pageAnnotations = annotationEditor.ExtractAnnotations(page.Number, page.Number, annotationTypes);


for (int i = 0; i < pageAnnotations.Count; i++)
{
    var currentAnnotation = pageAnnotations[i];
    if (currentAnnotation is HighlightAnnotation highlightAnnotation)
    {
        var commentText = currentAnnotation.Contents;
        // Create a TextAbsorber to extract text fragments that intersect with the annotation’s rectangle
        TextAbsorber absorber = new TextAbsorber();
        absorber.TextSearchOptions = new TextSearchOptions(true); // Enable regular expression search
        absorber.TextSearchOptions.Rectangle = highlightAnnotation.Rect;

        // Accept the absorber on the page to get the text fragments
        page.Accept(absorber);

        // Get the extracted text from the absorber
        var innerText = absorber.Text;
        var replaceTextAnnotation = new ReplaceTextAnnotation
        {
            InnerText = "Inner Text: " + innerText,
            Contents = "Comment Text: " + commentText,
            AnnotationType = currentAnnotation.GetType().ToString()
        };
        Console.WriteLine(innerText);
        //Console.WriteLine(commentText);

        annotations.Add(replaceTextAnnotation);
    }
}


}

three30 · July 24, 2023, 2:08pm

The most recent comment is the code

three30 · July 24, 2023, 2:09pm

Thank you in advance

asad.ali · July 24, 2023, 9:51pm

@three30

We were unable to resolve ReplaceTextAnnotation Class. Can you please share its reference or its name with the complete namespace?

three30 · July 25, 2023, 12:40pm

public class ReplaceTextAnnotation

{
public string Contents { get; set; }
public string InnerText { get; set; }
public string AnnotationType { get; set; }
// Add any other properties you want to store for the ReplaceText annotation.
}

three30 · July 25, 2023, 12:40pm

I apologize i did not realize that i did not include that. Thank you again.

asad.ali · July 25, 2023, 7:17pm

@three30

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55155

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

three30 · July 26, 2023, 7:35pm

Hi @asad.ali . Thank you again for taking the time to look at this issue. Were you able to recreate an instance of this problem? Also, do you have an eta of when this problem may be resolved?

asad.ali · July 26, 2023, 9:10pm

@three30

Yes, we were able to notice the issue in our environment while using Aspose.PDF for .NET 23.7. Therefore, an issue is logged for the rectification. We are afraid that we do not have any ETA information at the moment because the ticket is not yet investigated. We will look into its details on a first come first serve basis and let you know once we make some progress in this regard. Please spare us some time.

We are sorry for the inconvenience.

three30 · July 8, 2024, 2:25pm

We wanted to follow up to see if this is still an open issue, if it has been resolved in a recent update to the software? If it’s still an issue do you have an ETA for the resolution?

asad.ali · July 8, 2024, 9:36pm

@three30

We are afraid that the earlier logged ticket could not get resolved due to other issues in the queue. Nevertheless, your concerns have been recorded and we will surely inform you once we make some definite progress towards ticket resolution. We highly appreciate your patience and comprehension in this regard. We apologize for the inconvenience.