Identify Paragraph and compare 2 strings and highlight the change in pdf

HI,
We have input as {“CurrentText” : “How are doing”, “Modified Text” : “What are you doing”}.

From above parameters if you see “How” is changed to “When”

Now in my PDF i need to look for “CurrentText” from above and compare above 2 strings and Highlight the changes word and add comment about the change.
Considering above Input attaching how output should be.ASposeSample.pdf (12.5 KB)

Appreciate the help i can get.

Thanks,
Kamal

@tejkamalleo

Please read following article about finding text from PDF and highlight it.
Search and Get Text from Pages of PDF
PDF Highlights Annotation using C#

Following code example shows how to find the text and highlight it.

// Open document
Document pdfDocument = new Document(dataDir + "ASposeSample.pdf");

// Create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("How are you doing");
// Accept the absorber for all the pages
pdfDocument.Pages[1].Accept(textFragmentAbsorber);
// Get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (TextFragment textFragment in textFragmentCollection)
{
    HighlightAnnotation ha = new HighlightAnnotation(pdfDocument.Pages[1], textFragment.Rectangle);
    ha.Color = Aspose.Pdf.Color.Yellow;
    pdfDocument.Pages[1].Annotations.Add(ha);
}

pdfDocument.Save(MyDir + "output.pdf");

Thank you so much for the update Tahir. I actually tried this already. Here the problem is you need to compare “current text” and “modified text” and identify difference and highlight only difference word and comment.
So in above example i need to highlight “How” only but not entire sentence
Can you please have a look at attached pdf in previous post.

@tejkamalleo

Aspose.PDF does not provide APIs to compare text of paragraphs. We have logged a feature request to compare the text of PDF as PDFNET-51550 in our issue tracking system. You will be notified via this forum thread once this feature is available. We apologize for your inconvenience.

Thanks for the update @tahir.manzoor For now i need to go with entire content then .
On another note what if content in pdf is in more than one line to highlight. i think textFragmentAbsorber picks only if content is in single line right ?
Can you please help me on same please

@tejkamalleo

Please share the sample input and expected output PDF files here for our reference. We will then update the issue according to your requirement.

@tahir.manzoor

Please find the sample attached and content i am looking to highlight is “Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s” .
This content is dynamic and can change.Sometimes content can come in one line or more than one line and different input too.Source.pdf (29.3 KB)

@tejkamalleo

As per our understanding, you want to compare text of paragraph.

In your last post, you shared that you want to highlight the text.

To ensure a timely and accurate response, please share the complete detail of your use case along with test scenarios. We will then provide you more information about your query.

Sure Let me give Proper Example.
Input Parameter : Adobe® Portable Document Format (PDF) is a universal file format that preserves all
of the fonts, formatting, colours and graphics of any source document, regardless of
the application and platform used to create it

Now in attached pdf i need to find above text and highlight as shown in attachmentInput.pdf (9.3 KB)
Output.pdf (15.1 KB)

@tejkamalleo

Please check the code example (AddHighlightAnnotationAdvanced) in the article shared in my old post here:

Thanks for the update @tahir.manzoor But it couldn’t find the text which is in multiline. In the example i attached above its entire paragraph i am search for.
Can you please point me to the query where you can search for content which is on more than 1 line please

@tejkamalleo

You can use following code example to find the multi line text and highlight it. However, for your case, this code does not work. We have logged this problem in our issue tracking system as PDFNET-51557. You will be notified via this forum thread once this issue is resolved. We apologize for your inconvenience.

        /// <summary>
        /// Advanced example for you want to highlight a multi-line fragment
        /// </summary>
        public static void AddHighlightAnnotationAdvanced()
        {
            var document = new Document(System.IO.Path.Combine(_dataDir, "sample_mod.pdf"));
            var page = document.Pages[1];
            var tfa = new TextFragmentAbsorber(@"Adobe\W+Acrobat\W+Reader", new TextSearchOptions(true));
            tfa.Visit(page);
            foreach (var textFragment in tfa.TextFragments)
            {
                var highlightAnnotation = HighLightTextFragment(page, textFragment, Color.Yellow);
                page.Annotations.Add(highlightAnnotation);
            }
            document.Save(System.IO.Path.Combine(_dataDir, "sample_mod.pdf"));
        }
        private static HighlightAnnotation HighLightTextFragment(Aspose.Pdf.Page page,
            TextFragment textFragment, Color color)
        {
            if (textFragment.Segments.Count == 1)
                return new HighlightAnnotation(page, textFragment.Segments[1].Rectangle)
                {
                    Title = "Aspose User",
                    Color = color,
                    Modified = DateTime.Now,
                    QuadPoints = new Point[]
                    {
                        new Point(textFragment.Segments[1].Rectangle.LLX, textFragment.Segments[1].Rectangle.URY),
                        new Point(textFragment.Segments[1].Rectangle.URX, textFragment.Segments[1].Rectangle.URY),
                        new Point(textFragment.Segments[1].Rectangle.LLX, textFragment.Segments[1].Rectangle.LLY),
                        new Point(textFragment.Segments[1].Rectangle.URX, textFragment.Segments[1].Rectangle.LLY)
                    }
                };

            var offset = 0;
            var quadPoints = new Point[textFragment.Segments.Count * 4];
            foreach (var segment in textFragment.Segments)
            {
                quadPoints[offset + 0] = new Point(segment.Rectangle.LLX, segment.Rectangle.URY);
                quadPoints[offset + 1] = new Point(segment.Rectangle.URX, segment.Rectangle.URY);
                quadPoints[offset + 2] = new Point(segment.Rectangle.LLX, segment.Rectangle.LLY);
                quadPoints[offset + 3] = new Point(segment.Rectangle.URX, segment.Rectangle.LLY);
                offset += 4;
            }

            var llx = quadPoints.Min(pt => pt.X);
            var lly = quadPoints.Min(pt => pt.Y);
            var urx = quadPoints.Max(pt => pt.X);
            var ury = quadPoints.Max(pt => pt.Y);
            return new HighlightAnnotation(page, new Rectangle(llx, lly, urx, ury))
            {
                Title = "Aspose User",
                Color = color,
                Modified = DateTime.Now,
                QuadPoints = quadPoints
            };
        }

        /// <summary>
        /// How to get a Highlighted Text
        /// </summary>
        public static void GetHighlightedText()
        {
            // Load the PDF file
            Document document = new Document(System.IO.Path.Combine(_dataDir, "sample_mod.pdf"));
            var highlightAnnotations = document.Pages[1].Annotations
                .Where(a => a.AnnotationType == AnnotationType.Highlight)
                .Cast<HighlightAnnotation>();
            foreach (var ta in highlightAnnotations)
            {
                Console.WriteLine($"[{ta.GetMarkedText()}]");
            }
        }