Search the text from two paragraph in aspose PDF

ansaridurai · July 18, 2024, 10:59am

Trying to serach the content using TextFragmentAbsorber, Getting text fragements for single paragraph, a word but if search mutiple paragraph content its not working, Please provide any work around, Refer the img the content I need search and add highlight annoatation.
Capture.PNG (145.8 KB)

asad.ali · July 18, 2024, 7:39pm

@ansaridurai

Would you please share your sample PDF with us as well? We will test the scenario in our environment and address it accordingly.

ansaridurai · July 19, 2024, 5:48am

Sample.pdf (91.0 KB)

seraching the whole text in that file and getting empty text fragment.
refer the sample pdf file.

asad.ali · July 19, 2024, 3:29pm

@ansaridurai

Can you please also share the code sample that you are using to extract the text? We will test the scenario in our environment and address it accordingly.

ansaridurai · July 22, 2024, 6:06am

below sample code is used to search the text, If text content has two or more paragraph content then no text fragment in the list.

var tfa = new TextFragmentAbsorber(new Regex(textContent.Trim().Replace(" ", @"\s*").Replace("(", @"\(").Replace(")", @"\)").Replace(".", @"\.").Replace(":", @"\:").Replace("-", @"\-").Replace(' ', ' ').Replace("", "")), new TextSearchOptions(true));                    
                    TextSearchOptions textSearchOptions = new TextSearchOptions(true);
                    tfa.TextSearchOptions = textSearchOptions;                                    
                    doc.Pages.Accept(tfa);

asad.ali · July 22, 2024, 6:35pm

@ansaridurai

We are assuming that you are copying all text from the PDF and assigning it to the textContent variable. Right?

ansaridurai · July 23, 2024, 5:56am

Yes, I will get the text content from html. thus html is converter to Pdf at first.

i.e string textContent = element.TextContent;

finding that text content in PDF. for single para, single word its working but for mutiple para is not searching.

asad.ali · July 23, 2024, 11:47am

@ansaridurai

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-57723

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

asad.ali · November 19, 2024, 9:45pm

@ansaridurai

we have checked for this problem but could not find it. Please check your regular expression to search the text. Note that you can search the entire text of the document and try to find any result using your regular expression. As you can see, TextFragmentAbsorber returns all the text from the page, but there are no results that could be based on your regular expression. Take a look at the following code snippet.

var input = GetInputPath("57723.pdf");
string textContent = @"The Directors recognise that the strength of our business is built on the hard work, loyalty, dedication
and abilities of all of our people. The success of our business depends on attracting, retaining, and
motivating employees. Ensuring that we remain a responsible employer, from pay and benefits to our health, safety and
workplace environment, the Directors' factor the implications of decisions on employees and the wider
workforce where relevant and feasible.";
Document pdfDocument = new Document(input);
var tfa = new TextFragmentAbsorber();
pdfDocument.Pages.Accept(tfa);
var rx = new Regex(textContent.Trim().Replace(" ", @"\s*").Replace("(", @"(").Replace(")", @")")
.Replace(".", @".").Replace(":", @":").Replace("-", @"-").Replace(' ', ' ').Replace("", ""));
var matches = rx.Matches(tfa.Text).Count;