TextFragmentAbsorber find same text two times

cyginfo · May 3, 2022, 12:31pm

Hi,

I am using below code to extract PDF content using below code,

            TextAbsorber textAbsorber = new TextAbsorber();
            doc.Pages.Accept(textAbsorber);
            string extractedText = textAbsorber.Text;

I can see MSA audit and PCPA AUDIT found single time in attached SAMPLE_PDF.pdf. but same same word find mutiple time when using below code,
foreach (var tagmodel in ShapesWithSignerTagModel)
{
var pageArray = tagmodel.PageNo.Distinct();
TextFragmentAbsorber absorber = new TextFragmentAbsorber(tagmodel.Tag);

                    foreach (int page in pageArray)
                    {
                        doc.Pages[page].Accept(absorber);
                    }

                    textFragments.AddRange(absorber.TextFragments.AsEnumerable());
                }

                foreach (TextFragment textFragment in textFragments.OrderBy(a => a.Page.Number))
                {
                    textFragment.Text = ""; // **MSA audit** and **PCPA AUDIT** found mutiple times
                }

SAMPLE_PDF.pdf (106.5 KB)

I have uploaded full sample code in below URL,

asad.ali · May 3, 2022, 8:13pm

@cyginfo

The reason you are getting duplicate search results is that you are initializing the TextFragmentAbsorber Class outside the loop due to which previous results of found text do not get cleared from the cache. You need to initiate the instance inside the loop like below:

foreach (int page in pageArray)
{
 TextFragmentAbsorber absorber = new TextFragmentAbsorber(tagmodel.Tag);
 doc.Pages[page].Accept(absorber);
}

Feel free to let us know in case you still notice any issues.

cyginfo · May 4, 2022, 6:13am

Hi,

I have change code as below but still facing same issue(I mean getting both text two times)

                foreach (var tagmodel in ShapesWithSignerTagModel)
                {
                    var pageArray = tagmodel.PageNo.Distinct();
                    
                    foreach (int page in pageArray)
                    {
                        TextFragmentAbsorber absorber = new TextFragmentAbsorber(tagmodel.Tag);
                        doc.Pages[page].Accept(absorber);
                        textFragments.AddRange(absorber.TextFragments.AsEnumerable());
                    }                        
                }

                foreach (TextFragment textFragment in textFragments.OrderBy(a => a.Page.Number))
                {
                    textFragment.Text = "";
                }

asad.ali · May 4, 2022, 7:39pm

@cyginfo

We were able to replicate the issue in our environment while testing the scenario with 22.4 version of the API. Therefore, it has been logged as PDFNET-51732 in our issue management system. We will further look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

cyginfo · May 25, 2022, 9:37am

Hi,

Can we have an update on PDFNET-51732?

asad.ali · May 25, 2022, 1:34pm

@cyginfo

The ticket has recently been logged in our issue management system and it is not yet fully investigated. We will analyze and resolve it on first come first serve basis as per free support policies and let you know as soon as we make some definite progress towards its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

jsign · May 25, 2022, 5:01pm

Could you please provide us ETA for PDFNET-51732?

asad.ali · May 25, 2022, 9:26pm

@jsign

As shared earlier that the issue has not been yet investigated and without its full analysis we are afraid that we cannot share any reliable ETA or timeframe for its fix. Your concerns have been recorded and we will surely let you know in this forum thread once some updates are available in this regard. Please spare us some time.

We apologize for the inconvenience.