How to Search for multiple Text or Paragraph from a pdf file and Rectangle Coordinates of the Searched Text if Found

How to Search for multiple Text or Paragraph from a pdf file and get Rectangle Coordinates of the Searched Text if Found

@kranthireddyr

You can extract text from a PDF document using Regular Expressions and once the text is extracted, you can check the Rectangle Property of TextFragments in order to get the coordinates. Please check the examples in the below documentation section:

How to search for vertical content text Capture2.JPG (15.9 KB)

How can i Search first two lines of the attached document

@kranthireddyr

Could you please share the PDF document that consists of this text. We will test the scenario in our environment and address it accordingly.

While Searching Text i am getting below error :
“At most 4 elements (for any collection) can be viewed in evaluation mode.”

but the searched item is available in the pdf and for some of the test document it is working correctly.

Document pdfDocument = new Document(_dataDir + “filName.pdf”);

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(“Search Text”);

pdfDocument.Pages.Accept(textFragmentAbsorber);

@kranthireddyr

You are facing this exception due to trial version limitation as trial mode does not allow you to process/view more than 4 elements of any collection e.g. Paragraphs, TextFragments, Annotations, etc.

Please try to apply a valid license or consider getting a 30-days free temporary license in order to use the API without any restrictions and in its full capacity. Please let us know in case you face any other issue.

How can we search multiple line . Attaching sample pdf sample.pdf (3.0 KB)
where i need to search for the text "just for use in the Virtual Mechanics tutorials. More text. And more text. And more text."

@kranthireddyr

You can specify regular expressions in order to get multiline text. Aspose.PDF identifies the line break and space with the expression “\s*”. Please check following code snippet to extract your particular phrase from the PDF:

Document pdfDocument = new Document(dataDir + "sample.pdf");
foreach (Page page in pdfDocument.Pages)
{
 var textFragmentAbsorber = new TextFragmentAbsorber(@"just\s*for\s*use\s*in\s*the\s*Virtual\s*Mechanics\s*tutorials.\s*More\s*text.\s*And\s*more\s*text\b");
 var textSearchOptions = new TextSearchOptions(true);
 textFragmentAbsorber.TextSearchOptions = textSearchOptions;
 page.Accept(textFragmentAbsorber);
 var textFragmentCollection = textFragmentAbsorber.TextFragments;
 // Perform other stuff
}

Thanks for the response , This is somehow working for the provided sample pdf. But there is an issue with rotated text with this code. for the rotated pdf textFragmentCollection is coming correctly but the annotation highlight is not applied. Could you please check the sample pdf and try to search for "This is the vertical text for testing purpose. This is mrityunjay from xyz".

here is my sample pdf sample_1_Rotated1.pdf (38.1 KB)

below is my code :

string _dataDir = @“D:\PDF_Files”;
string searchText = @“This is the vertical text for testing purpose. This is mrityunjay from xyz”;
string fileName = “sample_1_Rotated1.pdf”;
string regSearchText = searchText.Replace(" “, @”\s*");
Rotation rt;
Document pdfDocument = new Document(_dataDir + fileName);
foreach (Page page in pdfDocument.Pages)
{
rt = page.Rotate;
page.Rotate = Rotation.None;

                var textFragmentAbsorber = new TextFragmentAbsorber(regSearchText);
                var textSearchOptions = new TextSearchOptions(true);
                textFragmentAbsorber.TextSearchOptions = textSearchOptions;
                page.Accept(textFragmentAbsorber);
                var textFragmentCollection = textFragmentAbsorber.TextFragments;
                
                // Perform other stuff
                foreach (TextFragment textFragment in textFragmentCollection)
                {
                    HighlightAnnotation annotation = new HighlightAnnotation(page, 
                      textFragment.Rectangle);
                    annotation.Color = Color.Yellow;
                    annotation.Title = "Team";
                    annotation.Contents = "This is test conteny by Jay";
                    page.Annotations.Add(annotation);
                    page.Rotate = rt;
                }
            }
            
            pdfDocument.Save(_dataDir + "Sample_out.pdf");

@kranthireddyr

We added the annotation into page while considering the rotation with following line of code and obtained the attached output PDF:

page.Annotations.Add(annotation, true);

test20.12.out.pdf (36.8 KB)

Could you please check it and let us know if you still see any issue inside it?

Thank u for the response, I am still facing same issue as earlier after considering rotation true. I am using aspose.pdf version 17.11.0.0 , i guess version won’t be any issue

Below is my code. Could u please see if any thing wrong here

Input File : sample_1_Rotated1.pdf (38.1 KB)
Output File : test20.12.out.pdf (36.8 KB)
Search Text : "This is the vertical text for testing purpose. This is mrityunjay from xyz" .
dll Version : aspose.pdf version 17.11.0.0

string _dataDir = @“D:\Plane_PDF_Files”;
string searchText = @“This is the vertical text for testing purpose. This is mrityunjay from xyz”;
string fileName = “sample_1_Rotated1.pdf”;
string regSearchText = searchText.Replace(" “, @”\s*");
Document pdfDocument = new Document(_dataDir + fileName);
foreach (Page page in pdfDocument.Pages)
{
var textFragmentAbsorber = new TextFragmentAbsorber(regSearchText);
var textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
page.Accept(textFragmentAbsorber);
var textFragmentCollection = textFragmentAbsorber.TextFragments;

                foreach (TextFragment textFragment in textFragmentCollection)
                {
                    HighlightAnnotation annotation = new HighlightAnnotation(page, textFragment.Rectangle);
                    annotation.Color = Color.Yellow;
                    annotation.Title = "Team";
                    annotation.Contents = "This is test conteny by Jay";
                    page.Annotations.Add(annotation, true);
                }
            }
            pdfDocument.Save(_dataDir + "test20.12.out.pdf");

@kranthireddyr

It seems like the issue is related to the older version of the API as we were unable to observe it with the latest available version. Would you please try using Aspose.PDF for .NET 20.12 and let us know in case you still face any issue.

Can you please verify with aspose.pdf version 17.11.0.0

@kranthireddyr

As shared earlier, the issue seems to be related with the older version of the API. Please note that 17.11 version of the API is quite old and a lot of new Classes and Methods have been added to the API since then. Also, we are providing support on the basis of the latest available version. It is requested that you please try to use the latest available version i.e. 20.12 and in case still still persists, please let us know.