Search text from PDF file using C# and Aspose.PDF for .NET | returns 0 in TextFragmentCollection

KrishnaPMI · June 10, 2021, 5:03pm

TextFragmentAbsorber textFragmentAbs = new TextFragmentAbsorber(“2.1.1.1 Identify”);
textFragmentAbs.ExtractionOptions.FormattingMode = TextExtractionOptions.TextFormattingMode.Pure;
pdfDocument.Pages.Accept(textFragmentAbs);
TextFragmentCollection textFragmentCol = textFragmentAbs.TextFragments;

This is the code written in .Net Core(C#) which I am using to get textfragment with matching string but it returns 0 in TextFragmentCollection.

Can anyone suggest the solution for this? It will be a great help for me.

I have attached sample file for this.
TestPage.pdf (73.3 KB)

asad.ali · June 10, 2021, 9:19pm

@KrishnaPMI

Sometime, Adobe Reader stores space between words as character and in order to match it, you need to use Regular Expressions. Please use the code like below in order to find the target keyword in the PDF:

Document doc = new Document(dataDir + "TestPage.pdf");
TextFragmentAbsorber textFragmentAbs = new TextFragmentAbsorber(@"2.1.1.1+\sIdentify", new TextSearchOptions(true));
textFragmentAbs.ExtractionOptions.FormattingMode = TextExtractionOptions.TextFormattingMode.Pure;
doc.Pages.Accept(textFragmentAbs);
TextFragmentCollection textFragmentCol = textFragmentAbs.TextFragments;
Console.WriteLine(textFragmentCol.Count);

KrishnaPMI · June 11, 2021, 11:54am

Thank you @asad.ali
The solution you have suggested has worked for me.

I have one concern with this is that I have PDF book with 400 pages and my requirement is to get the position(page number with coordinates) of all the bookmark/title of the book. The solution you have suggested is not working for all. If you have observed the page that I have shared with you has one more title “2.1.1.2 Understand and Analyze” which has multiple words and when I am replacing the space between words with “+\s” it is not working. It will be great help if you can suggest me what regular expression I can use to resolve this issue.
Or While creating PDF what should I take care so that space between words can not stored/replaced with character.

asad.ali · June 14, 2021, 5:07pm

@KrishnaPMI

We have also noticed the similar behavior during testing the scenario in our environment. Therefore, we have logged an issue as PDFNET-50079 in our issue tracking system for the sake of correction. We will further look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

KrishnaPMI · June 17, 2021, 7:39am

@asad.ali

Thank you for your response. Please let me know if you and your team find any solution as this is causing an issue for me to fulfill my exact requirement.

asad.ali · June 17, 2021, 3:44pm

@KrishnaPMI

We will surely investigate and resolve the ticket on a first come first serve basis and inform you as soon as it is rectified. Please give us some time.

We apologize for the inconvenience caused.