How to get Highlighted text from PDF file

Hi Team,


How to get the highlighted text from PDF using Aspose.PDF tool in .Net?

Please share me the sample code and let me know which version Aspose.PDF tool will support this feature.

Regards,
Ganesan. B

Hi Ganesan,


Thanks for contacting support.

Aspose.Pdf for .NET supports the feature to search particular TextFragment from PDF file and get properties (i.e. Font, Size, ForeGround color, Background color etc) associated with it. However concerning to your requirement, you can traverse through all the TextFragments from PDF file and determine which TextFragment contains text background. Please visit the following link for further details on Search and get Text from all pages using Regular Expression

In the event of any further query, please feel free to contact.

Hello,



I am using TextFragment to retrieve all text in pdf document but the
highlighted text does not have the BackgroundColor property set to the
color used to highlight (yellow).



Here’s my code :

           TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+"); //gets all texts

//set text search option to specify regular expression usage
TextSearchOptions textSearchOptions = new TextSearchOptions(true);

textFragmentAbsorber.TextSearchOptions = textSearchOptions;

//accept the absorber for all the pages
pdfDocument.Pages.Accept(textFragmentAbsorber);

//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;



List<TextFragment> lst =
textFragmentCollection.Cast<TextFragment>()
.Where(o => o.TextState.BackgroundColor == Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Yellow)).ToList();

=> My list is empty. I have words highlighted in yellow (using the Adobe Highlight) Am i missing something here ?



thank you for your help.



Sincerely,

Lucas




Hi Lucas,


Thanks for contacting support.

Please share the resource PDF file so that we can test the scenario at our end. We are sorry for this inconvenience.

Hello,
please find attached the pdf file.

Thank you for the support.

Hi Lucas,


Thanks for sharing the resource file.

I
have tested the scenario and I am able to reproduce the same problem that background color information for TextFragment is not being returned. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-37503. We will
investigate this issue in details and will keep you updated on the status of a
correction. <o:p></o:p>

We apologize for your inconvenience.

Hi Lucas,


Thanks for your patience. We have further looked into your document and would like to update you that it is not possible to retrieve background color of text in the Pdf document. The background is set usually in a different non-specified manner (drawing operations, annotation). For example, your document does not contain background, but highlight annotation.


Moreover, please note that the value is not preserved as a text characteristic within the document. The BackgroundColor property of an object can be retrieved if in case it was explicitly set previously with BackgroundColor setter for the object.


Please feel free to contact us for any further assistance.


Best Regards,

Hello,

thank you for the response. I am actually using Highlight class to retrieve my highlighted text. Therefore I am still having a problem since I cannot know the content of the text. In other word it’s impossible to retrieve the text that have been highlighted in a pdf file. Is that exact ?

Thank you

Lucas


Hi Lucas,


Thanks for your inquiry. Yes it is impossible to get highlighted text if its background is set by some other application than Aspose.Pdf.

However if the text is been highlighted using some annotation then you can find the annotation and get text from its rectangle.

Please feel free to contact us for any further assistance.

Best Regards,

for the response. But I am still not getting it. I am using the HighlightAnnotation type to retrieve my Highlighted 'rectangles" and it works ! but once this I cannot retrieve the content of it (the text).

Thank you very much for your help.

here’s my code :

Document pdfDoc = new Document(originalDocumentName);
for (int y = 1; y <= pdfDoc.Pages.Count; y++)
{
Page page = pdfDoc.Pages[y];

List<Aspose.Pdf.InteractiveFeatures.Annotations.Annotation> annotations =
page.Annotations.Cast<Aspose.Pdf.InteractiveFeatures.Annotations.Annotation>()
.Where(o => o.Color == Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Yellow)).ToList();

foreach (Aspose.Pdf.InteractiveFeatures.Annotations.Annotation annotation in annotations)
{
if (annotation is HighlightAnnotation)
{
Rectangle rect = annotation.Rect;

// No infos on the text or the content it self.

}
}
}

Hi Lucas,


Thanks for your inquiry. Please check following code snippet to get text of HighlightAnnotation. Hopefully it will help you to accomplish the task.

Document pdfDocument = new Document(myDir

  • “testAspose.pdf”);<o:p></o:p>

foreach (Page aPage in pdfDocument.Pages)

{

foreach (Aspose.Pdf.InteractiveFeatures.Annotations.Annotation anAnnotation in aPage.Annotations)

{

if (anAnnotation is HighlightAnnotation)

{

HighlightAnnotation linkAnno = (HighlightAnnotation)anAnnotation;

Aspose.Pdf.Rectangle rect = linkAnno.Rect;

// create TextAbsorber object to extract text

TextAbsorber absorber = new TextAbsorber();

absorber.TextSearchOptions.LimitToPageBounds = true;

absorber.TextSearchOptions.Rectangle = rect;

// accept the absorber for first page

aPage.Accept(absorber);

// get the extracted text

string extractedText = absorber.Text;

Console.Out.WriteLine("HighlightAnnotation text: {0}",extractedText);

}

}

}

pdfDocument.Dispose();

Please feel free to contact us for any further assistance.


Best Regards,

Hi codewarior

How do you Highlight text in a PDF file. Do you perhaps have code for that particular function.

@siphosethunogcazi

Thanks for contacting support.

Please check following code snippet, in order to search and highlight text inside PDF document.

Document document = new Document(dataDir + "TestInputPDF.pdf");
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"Garrett\sNevels");
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
document.Pages.Accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection1 = textFragmentAbsorber.TextFragments;
if (textFragmentCollection1.Count > 0)
{
 foreach (TextFragment textFragment in textFragmentCollection1)
 {
   Aspose.Pdf.Annotations.HighlightAnnotation freeText = new Aspose.Pdf.Annotations.HighlightAnnotation(textFragment.Page, new Aspose.Pdf.Rectangle(textFragment.Position.XIndent, textFragment.Position.YIndent, textFragment.Position.XIndent + textFragment.Rectangle.Width, textFragment.Position.YIndent + textFragment.Rectangle.Height));
   freeText.Opacity = 0.5;
   freeText.Color = Aspose.Pdf.Color.FromRgb(0.6, 0.8, 0.98);
   textFragment.Page.Annotations.Add(freeText);
  }
}
document.Save(dataDir + "TestInputPDF_out.pdf");

In event of any further query, please feel free to let us know.

Hello Asad

I tried the above code. But it seems I intended to ask you how to Highlight fields in a PDF file and also provide comments on the highlighted field.

@siphosethunogcazi

Thanks for writing back.

Please clarify if you are asking about form fields inside PDF Document. Would you please add some more details to your requirements by sharing a sample PDF with us, so that we can check the respective details at our side and share our feedback.