TextFragmentAbsorber is unable to find the text in PDF

emeghana · April 26, 2023, 2:36pm

Below is the code used to find text in the pdf file.
“12345” text is just for a reference.

var textFragmentAbsorber = new TextFragmentAbsorber(“12345”);
pdfDocument.Pages[80].Accept(textFragmentAbsorber);
foreach (TextFragment textFragment in textFragmentAbsorber.TextFragments)
{

}

TextFragmentAbsorber is not holding anything in TextFragmentAbsorber.TextFragments in some PDFs even though the text is there. In this case TextFragmentAbsorber.TextFragments.Count is 0.
And here we are not checking it in entire PDF. we are checking in a specific page as you can see in the below code.
PDF size is 36MB.

The same logic working for other small PDF’s.

Any work around here ?

carlos.molina · April 26, 2023, 3:16pm

@emeghana,

That code should work. Is there a way for you to share that PDF with me so I can test that locally?

Because if it fails for me and I have the PDF, I can create a ticket with that information for the dev team.

emeghana · April 26, 2023, 3:30pm

I Cant share the PDF.

emeghana · April 26, 2023, 3:31pm

As i mentioned before this logic is working fine for almost all the small PDF’s . For one particular big PDF we have the issue.

carlos.molina · April 26, 2023, 3:42pm

@emeghana,

You cannot share it because it is too big or confidential?

Can you try deleting all pages but 80?

If it is confidential, can you edit or redact all content but the text you want to search for?

Now you will have a document that you can share for testing purposes.

Please let me know. For me to escalate an issue, I need to be able to replicate it on demand so I can provide that information and files to the dev team.

Hopefully, this will make it possible to share the Pdf.

I am not sharing a workaround because there is none for what you are doing. Your code is correct, so for me, this seems like a bug more than an issue with your code at all.

emeghana · April 26, 2023, 5:59pm

Sorry i can’t give the file because of the confidential information and i can’t reveal the format of the file as well so cant attach the file here.

But i found something here. Please review it and let me know how can i fix this.

PDF has some text in one cell as follows: “Established Organization (11111) is started on 2022.”
Search Text: Established Organization (11111)

For Example:
Method1:
var textFragmentAbsorber = new TextFragmentAbsorber(“Established Organization (11111)”);
pdfDocument.Pages[80].Accept(textFragmentAbsorber);
This is not returning the TextFragment. I want this method to work and return the TextFragment info of the search text from Page number 80.

Method2:
The same text could be find in different segments:
Established
Organization
11111

I can find the text if i search them with individual word with following code, but this is not what i want.
foreach (TextFragment textFragment in tfc)
{
foreach (TextSegment textSegment in textFragment.Segments)
{
if (textSegment.Text.ToString().Contains(“Established”) ||
textSegment.Text.ToString().Contains(“Organization”) ||
textSegment.Text.ToString().Contains(“11111”))
{
// text is finding here in entire PDF and running the inside logic
}
}
}

carlos.molina · April 26, 2023, 8:13pm

@emeghana,

I am trying to replicate this. So my Pdf need to have a table, with a row and a column. Within that row a cell with the text: “Established Organization (11111) is started on 2022.”

Is this correct?

emeghana · April 27, 2023, 6:24am

it is Correct

emeghana · April 27, 2023, 12:00pm

And I also believe as my document is so big, while running the below line, there might be “out of memory or datetime out” issue (note:- But code is not throwing anything). so, it’s not loading any fragments as shown in the first image.

var textFragmentAbsorber = new TextFragmentAbsorber(“Established Organization (11111)”);

image.jpg (107.3 KB)

but observe the below image, in single page of PDF, it has 758 fragments.
2. image.png (106.6 KB)

My PDF has 1301 pages and 10MB
It might be some “out of memory or datetime out” issue.

carlos.molina · April 27, 2023, 2:57pm

@emeghana,

At the moment, I cannot replicate the issue. Is there any other document you may have without confidential information or that you can doctorate where the problem is present?

The fragment counts can vary, since you can limit it.

emeghana · April 27, 2023, 3:11pm

I wish I could have given the doc. In my case it is 1000+ pages of 10MB File. It’s impossible for me to provide a dummy doc for 1000+ pages of 10MB File.

But the issue is coming with when PDF has more pages.
If you have any PDF with 1000+ pages, please test my scenario. It definitely replicates.

It’s a work stopper and we do have license to aspose.

carlos.molina · April 27, 2023, 4:36pm

@emeghana,

That is why I mentioned you before to delete all other pages. but the one with the issue. Then create a new document with one page only. The one you were doing the search and see if the search works. This way we will figure it out if it has to do with the document size or with the page.

Can you follow this and let me know how it goes?

Edit: The size of the Pdf is not an issue if it is less than 490MB. Clients have sent use pdf with media incrusted. But in order to replicate the issue we need to have a document that fails in the same way.

emeghana · April 27, 2023, 6:46pm

@carlos.molina,

I’ve tried your mentioned scenario. its not working when i try with single Page PDF.

Given one sample PDF here:

var textFragmentAbsorber = new TextFragmentAbsorber(“Established Organization (11111)”);
pdfDocument1.Pages[1].Accept(textFragmentAbsorber);
image.png (23.6 KB)

Test1.pdf (11.7 KB)

Try with the attached PDF and there is nothing in it, this is the plain PDF with one single line of text. Even with this, the code is not identifying the fragments.

carlos.molina · April 27, 2023, 8:48pm

@emeghana,

When using Adobe Acrobat to copy the text from the document, the text is using { instead of (

But regardless of that, even when removing every filter, the TFA comes empty. I tried a different PDF, and it worked fine, so there must be an issue with the TFA.

So I will open a ticket to the dev team with all this information.

carlos.molina · April 27, 2023, 8:52pm

@emeghana
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54486

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.