Search and Highlight paragraph using Aspose PDF

Hai,
How can I search for a paragraph and highlight it in pdf document using aspose.net?

@pooja.jayan

Can you please share your sample PDF document along with the information of the text/paragraph that you want to highlight? We will check it in our environment and share our feedback with you accordingly.

Hai,
Thank you for your response.

Here I am attaching the PDFSample Document.pdf (168.3 KB)
document.

The document has two pages with 2-column layout. My requirement is I want to search a paragraph say for
Eg : “State the thesis or main point in your first few sentences so
your professor will see this part of the answer right away. Use
clear statements that directly answer the question.”
(2nd page right side first sentence)
in the document and if it has such a paragraph, it has to be highlighted.

And I need to the same in single column layout documents also.
Thank you.

@pooja.jayan

Please try to use the below code snippet in order to add highlight annotation in your PDF:

Document doc2 = new Document(new FileStream(dataDir + "Sample Document.pdf", FileMode.Open, FileAccess.ReadWrite));
TextFragmentAbsorber tfa = new TextFragmentAbsorber(@"State\s+the\s+thesis\s+or\s+main\s+point\s+in\s+your\s+first\s+few\s+sentences\s+so\s+your\s+professor\s+will\s+see\s+this\s+part\s+of\s+the\s+answer\s+right\s+away.\s+Use\s+clear\s+statements\s+that\s+directly\s+answer\s+the\s+question.", new TextSearchOptions(true));
doc2.Pages.Accept(tfa);

foreach (var textsegment in tfa.TextFragments[1].Segments)
{
 HighlightAnnotation ha = new HighlightAnnotation(doc2.Pages[2], textsegment.Rectangle);
 ha.Color = Color.FromArgb(255, 196, 212, 167);
 doc2.Pages[2].Annotations.Add(ha);
}
doc2.Save(dataDir + "PDF_Highlighting.pdf");

PDF_Highlighting.pdf (183.5 KB)

Hai,
Thank You for your response.

I tried the code you shared with me, I want you to have a look at this also:

  1. Page number of document cannot be given explicitly, as my input is a paragraph and a document only, and program has to search the entire document for the paragraph and if a match is found, the document with the specified paragraph portion highlighted has to be returned

I tried the following code: But each page is getting highlighted.

        ParagraphAbsorber absorber = new ParagraphAbsorber();
        absorber.Visit(doc2);
        foreach (PageMarkup markup in absorber.PageMarkups)
        {

            TextFragmentAbsorber tfa = new TextFragmentAbsorber(@"State\s+the\s+thesis\s+or\s+main\s+point\s+in\s+your\s+first\s+few\s+sentences\s+so\s+your\s+professor\s+will\s+see\s+this\s+part\s+of\s+the\s+answer\s+right\s+away.\s+Use\s+clear\s+statements\s+that\s+directly\s+answer\s+the\s+question.", new TextSearchOptions(true));
            doc2.Pages.Accept(tfa);

            foreach (var textsegment in tfa.TextFragments[1].Segments)
            {
                HighlightAnnotation ha = new HighlightAnnotation(doc2.Pages[markup.Number], textsegment.Rectangle);
                ha.Color =Color.FromArgb(255, 196, 212, 167);
                doc2.Pages[markup.Number].Annotations.Add(ha);
            }
        }


        doc2.Save(path + "PDF_with_Highlighted_Text.pdf");

Could you please help me resolve this also?

@pooja.jayan

Please try to use the code like below in order to give page number dynamically:

Document doc2 = new Document(new FileStream(dataDir + "Sample Document.pdf", FileMode.Open, FileAccess.ReadWrite));
TextFragmentAbsorber tfa = new TextFragmentAbsorber(@"State\s+the\s+thesis\s+or\s+main\s+point\s+in\s+your\s+first\s+few\s+sentences\s+so\s+your\s+professor\s+will\s+see\s+this\s+part\s+of\s+the\s+answer\s+right\s+away.\s+Use\s+clear\s+statements\s+that\s+directly\s+answer\s+the\s+question.", new TextSearchOptions(true));
doc2.Pages.Accept(tfa);

foreach (var textsegment in tfa.TextFragments[1].Segments)
{
 HighlightAnnotation ha = new HighlightAnnotation(tfa.TextFragments[1].Page, textsegment.Rectangle);
 ha.Color = Color.FromArgb(255, 196, 212, 167);
 doc2.Pages[2].Annotations.Add(ha);
}
doc2.Save(dataDir + "PDF_Highlighting.pdf");

Hai,
Thank you for your response.

Is there any way to match similar paragraph not an exact match? In my case my input is from API response which may contain numbering changed from 1. to A,and sometimes an extra space or dot. so I need a solution which do not fails in search when my input contains an extra character than original paragraph in document.

Please find me a solution if possible.

@pooja.jayan

Do you want to use one regular expressions to find different paragraphs which are slightly different? Please let us know if our understandings are correct and share any two paragraphs/text that you want to extract using one regular expression. We will try to prepare an example for you and share it with you.

Hai,

Thankyou for your response.

I don’t want to find different paragraphs which are slightly different, My requirement is I want to find similar paragraph which is slightly different from my input string,

Here is my document : Sample Document.pdf (168.3 KB)

and this is the input I am giving to my program : “A. Than: comparison →→→ Then: time
The North had more soldiers in the battle
than the South. If the South had more
soldiers, then they would have won.” , and if a match found, that paragraph must be highlighted.

Here also we have a match found in Document, with slight difference in numbering (like its A. in Input and 1. in document), so it is not highlighting, might be because no exact match found

So I want to highlight the paragraph even when the paragraph is a little bit different from(in case of special characters, dot, and space) from the input string.

Is there any solution to this problem?

Thank you.

@pooja.jayan

Please share the sample code snippet so that we can see how you are giving above text as an input to highlight. We will try to modify the code in order to achieve your requirements and share it with you.

Hai,
Thank You for your response.

Here the code I have used:

        string fileName = "SampleDoc.pdf";

        string destFileName = path + fileName;

        //Extracting text from pdf
        Document doc = new Document(new FileStream(destFileName, FileMode.Open, FileAccess.ReadWrite));

        // Giving Input paragraph 
        string markData = "A SaaS APPLICATION ON TEXT MESSAGING " +
                            "SOLUTIONS – A ColdFusion Case Study " +
                            "Executive Summary:" +
                            "A 100 % open rate, 98 % read rate and 90 % response rate is an enviable figure where digital" +
                            "communication is considered.If you haven’t guessed it already the above statistics are for text" +
                            "messages.The proliferation of mobile phones all over the world(it is already 100 % in the USA) has led to" +
                            "many brands opting to include text messaging into their digital marketing strategy.";


        ParagraphAbsorber absorber = new ParagraphAbsorber();
        absorber.Visit(doc);

        Regex pattern = new Regex(@"[±–*\t\r\n\s ]|[\n]{2}|(\b[\w\.]\.)|(\b[\w\.]\))|(\([\w\.]\))|(\b[i,v,x,z]{1,3}\))");

        Regex patternForBulleting = new Regex(@"(\s[o]\s)");
        absorber.IsMulticolumnParagraphsAllowed = true;

        foreach (PageMarkup markup in absorber.PageMarkups)
        {
            markup.IsMulticolumnParagraphsAllowed = true;
            foreach (MarkupSection section in markup.Sections)
            {
                foreach (MarkupParagraph paragraph in section.Paragraphs)
                {
                    List<string> splittedString = Regex.Split(paragraph.Text, @"(?<=[.–:?])", RegexOptions.IgnorePatternWhitespace).ToList(); ;
                    String paragraphText = paragraph.Text.Replace("\r\no", " ");
                    paragraphText = pattern.Replace(paragraphText, "");
                    String inputString = pattern.Replace(newdata, "");


                    if (splittedString != null)
                    {
                        foreach (var item1 in splittedString)
                        {
                            string text = patternForBulleting.Replace(item1, "");
                            text = pattern.Replace(text, "");
                            if (inputString.Contains(text.ToString()) && text != "" && text != "." && text.Length > 2)
                            {
                                foreach (var line in paragraph.Lines)
                                {
                                    foreach (TextFragment item in line)
                                    {
                                        foreach (var textsegment in item.Segments)
                                        {
                                            HighlightAnnotation ha = new HighlightAnnotation(doc.Pages[markup.Number], textsegment.Rectangle);
                                            ha.Color = Color.Yellow;
                                            doc.Pages[markup.Number].Annotations.Add(ha);

                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
        doc.Save(path + "PDF_with_Highlighted_Text.pdf");

Reason why I haven’t used TextFragmentAbsorber is that my input paragraph and document content will not have an exact match. (My Input is from API and that will always have slight difference in words, characters,numberings from actual document content.

Here I am attaching the sample document I worked with: SampleDoc.pdf (1.1 MB)

and the paragraph to be highlighted is:
“A SaaS APPLICATION ON TEXT MESSAGING
SOLUTIONS – A ColdFusion Case Study
Executive Summary:
A 100% open rate, 98% read rate and 90% response rate is an enviable figure where digital
communication is considered. If you haven’t guessed it already the above statistics are for text
messages. The proliferation of mobile phones all over the world (it is already 100% in the USA) has led to
many brands opting to include text messaging into their digital marketing strategy.”

My issue is I don’t want common words or lines to be highlighted in other pages than the matched paragraph page.Here I am getting some words highlighting in all pages.

Try to help me find a solution for this.

@pooja.jayan

In your above code, what is the definition of “newdata” variable?

String inputString = pattern.Replace(newdata, ""); // newdata is missing in the code

Hai,
Thankyou for your response.

Please remove newdata and you please use “markData” instead of that
String inputString = pattern.Replace(markData, “”);

@pooja.jayan

We are testing the scenario and will get back to you shortly.

@pooja.jayan

@pooja.jayan

It looks like you have been able to achieve your requirements of highlighting paragraphs (as you mentioned in this post). We are responding you there so that you can further discuss the next issue you are facing.