Replace the Highlighted Text

seemadas · June 21, 2017, 1:37pm

I do have a requirement to replace the highlighted text from PDF document. Can you help us with the code to replace the highlighted text. Please find the sample attachment. We have to replace the yellow highlighted text with new text.

Capture.PNG (5.3 KB)

Thanks,
Seema

This Topic is created by asad.ali using the Email to Eopic plugin.

asad.ali · June 21, 2017, 4:19pm

Hello Seema,

Thanks for your inquiry.

Would you please share a sample PDF document, so that we can test the scenario in our environment and address it accordingly.

Best Regards,
Asad Ali

seemadas · June 22, 2017, 8:56am

SamplePDF.PDF (105.5 KB)
Please find the attached sample PDF document which has highlighted text.

asad.ali · June 22, 2017, 7:04pm

@seemadas

Thanks for sharing input document.

I have checked your document and observed that the text was not highlighted by setting its background, whereas, there were HighlighAnnotations used for the purpose. Please note that Aspose.Pdf for .NET provides the feature to extract text and determine its properties (i.e Font, Text Size, Foreground Color, Background Color, etc). So if text contains background color then you can match it by iterating through all text fragments absorbed by using TextFragmentAbsorber.

Furthermore, in case if PDF document contains HighlightAnnotation to highlight the text then as a workaround, you can extract all highlight annotations from PDF, get their rectangles, and extract text from those particular rectangles. Please check following code snippet which I have used to extract the text from your PDF document.

Document doc = new Document(dataDir + "SamplePDF_Highlighted.pdf");
foreach (Aspose.Pdf.Page page in doc.Pages)
{
 foreach(Annotation annot in page.Annotations)
 {
  if(annot is HighlightAnnotation)
  {
   Rectangle searchrectangle = annot.Rect;
   TextFragmentAbsorber tfa = new TextFragmentAbsorber();
   tfa.TextSearchOptions = new TextSearchOptions(searchrectangle);
   page.Accept(tfa);
   foreach(TextFragment tf in tfa.TextFragments)
   {
    Console.WriteLine(tf.Text);
   }
  }
 }
}

Highlighted_Text_Output.png (4.0 KB)

By using above workaround, you may extract text from PDF which is highlighted by using HighlighAnnotations. In case of any further assistance, please feel free to contact us.

Best Regards,
Asad Ali

seemadas · June 29, 2017, 8:40am

Hi Asad Ali,

Thanks for the code snippet. I have run through the code and found that it is not reading all highlighted texts and in some cases it is reading non highlighted text as well.

With that code I’m getting these texts from my SamplePDF document
MISSISSIPPI
riter: Bankers L
JAN 01 2017

In this case text ‘riter: Bankers L’ is not highlighted in the PDF document.
Other issue is there are more than 3 highlighted texts in the given PDF document but it is reading only 3.

Can you please help me to get only highlighted texts from the PDF document.

Thanks,
Seema

asad.ali · June 29, 2017, 4:04pm

@seemadas

Thanks for writing back.

I have tested the scenario again and observed the issue of incorrect output, though when I used TextAbsorber instead of TextFragmentAbsorber, the output was better and correct. Please check following code snippet and attached screenshot for your reference.

Document doc = new Document(dataDir + "SamplePDF_Highlighted.pdf");
foreach (Aspose.Pdf.Page page in doc.Pages)
{
 foreach(Annotation annot in page.Annotations)
 {
  if(annot is HighlightAnnotation)
  {
   Rectangle searchrectangle = annot.Rect;
   TextAbsorber tfa = new TextAbsorber();
   tfa.TextSearchOptions.Rectangle = searchrectangle;
   tfa.TextSearchOptions.LimitToPageBounds = true;
   page.Accept(tfa);
   Console.WriteLine(tfa.Text.Trim());
  }
 }
}

Highlighted_Text_Output.png (8.6 KB)

Please make sure that you are setting license correctly before extracting the text from PDF document, as you can see in the screenshot that all highlighted text has been returned from the API in output. It seems that you are not setting the license before performing extraction which is why only limited content is being returned by the API, due to trial version limitation.

In case of any further assistance, please feel free to contact us.

Best Regards,
Asad Ali

seemadas · July 3, 2017, 8:48am

Hi Asad,

I really appreciate the quick response on my issue. Thank you so much now I’m getting correct output.

Thanks,
Seema

asad.ali · July 3, 2017, 10:44am

@seemadas

Thanks for your feedback.

It is good to know that you have managed to get correct output by using suggested workaround. Please keep using our API and in event of any further inquiry, please feel free to contact us.

Best Regards,
Asad Ali

seemadas · July 4, 2017, 11:26am

I’m trying to replace annotation with plain text. For each annotation we will have to set dynamic text. I tried with below code but it is not working as expected.

foreach (Aspose.Pdf.Page page in doc.Pages)
{
foreach (Annotation annot in page.Annotations)
{
if (annot is HighlightAnnotation)
{
Rectangle searchrectangle = annot.Rect;
TextAbsorber tfa = new TextAbsorber();
tfa.TextSearchOptions.Rectangle = searchrectangle;
tfa.TextSearchOptions.LimitToPageBounds = true;
page.Accept(tfa);
Console.WriteLine(tfa.Text.Trim());

                    //repalce annotation with plain text
                    TextFragmentAbsorber ta = new TextFragmentAbsorber();                       
                    ta.TextSearchOptions = new TextSearchOptions(searchrectangle);
                    ta.Visit(doc.Pages[1]);
                    foreach (TextFragment tf in ta.TextFragments)
                    {
                        tf.Text = "Test";
                    }
                }
            }
        }

Can you please redirect me to a right API to implement this feature.

asad.ali · July 4, 2017, 7:41pm

@seemadas

Thanks for your inquiry.

Please check following workaround to replace only highlighted text from your PDF document. For your reference, I have also attached an output generated by below code.

Document doc = new Document(dataDir + "SamplePDF_Highlighted.pdf");
List<string> lstTexttoReplace = new List<string>();
foreach (Aspose.Pdf.Page page in doc.Pages)
{
 foreach(Annotation annot in page.Annotations)
 {
  if(annot is HighlightAnnotation)
  {
    Rectangle searchrectangle = annot.Rect;
    TextAbsorber ta = new TextAbsorber();
    ta.TextSearchOptions.Rectangle = searchrectangle;
    ta.TextSearchOptions.LimitToPageBounds = true;
    page.Accept(ta);
    string[] texts = ta.Text.Trim().Split('\n');
    foreach (string s in texts)
    {
     if(!String.IsNullOrEmpty(s.Trim()))
        lstTexttoReplace.Add(s.Trim());
    }
   }
 }
}
foreach (string s in lstTexttoReplace)
{
 TextFragmentAbsorber tfa = new TextFragmentAbsorber(s);
 doc.Pages.Accept(tfa);
 foreach (TextFragment tf in tfa.TextFragments)
 {
  tf.Text = "Test";
 }
}
doc.Save(dataDir + "SamplePDF_Highlighted_Replaced.pdf");

SamplePDF_Highlighted_Replaced.pdf (97.9 KB)

Furthermore, please note that above is only a workaround, with which you may get correct results, because we are not extracting text by specifying the BackgroundColor of the text as it is not present. So there are quite chances that output may change in case of other document(s). You may change/modify the code snippet as per your requirement as well.

In case of any further assistance, please feel free to contact us.

Best Regards,
Asad Ali