TextFragmentAbsorber returning symbols instead of text

mtevebaugh · November 3, 2023, 8:51pm

I have attached two pdfs from a project, one that works correctly, and one that does not. I stripped out any identifying information.
Broken.pdf (117.9 KB)
Works.pdf (229.6 KB)

Both pdfs are from the same project. When I do a text search on the files using textFragmentAbsorber, I get different results. The Works.pdf file returns all the file text as expected. I can access all text or search in a specific coordinate rectangle and get expected results. The current function I have is searching for the drawing number in the bottom right, which for Works.pdf is “29-A-SECELE-01”. I can find that value using a rect of (2176.243200, 66.110400, 2395.159200, 103.240800).

If I run that same rect search on the Broken.pdf, I get a garbage result: “63/1$”. The symbols don’t come through here, but I will post a screenshot.
SymbolCharacters.PNG (1.0 KB)
I have also tried exporting all textFragmentAbsorber text on the page and that yields a similar result. A few pieces of text come through in the middle, but mostly garbage.
Here is a partial view of the text export of Broken.pdf that shows the text “Verify Scales” surrounded by other symbols:
VerifyScales.PNG (5.8 KB)

I have confirmed that the text on Broken.pdf is selectable/searchable in a pdf viewer. The only difference I have found between the two files is that the broken file page.Rotate == Rotation.on90. I tried rotating the page before text read both in an external pdf editor and in aspose with the same result.

I am using Aspose.PDF .Net 23.10.0.0 on .Net framework 4.8.1

Here is the code I am using to export all text:

InitializeLicense();

// Open the document
using (Document pdfDocument = new Document(_filePath))
{
    string returnVal = string.Empty;
    TextFragmentAbsorber absorber = new TextFragmentAbsorber();

    // Loop through all the pages
    foreach (Page page in pdfDocument.Pages)
    {
        page.Accept(absorber);

        // Log page number and rotation
        foreach (var fragment in absorber.TextFragments)
        {
            _logger.Log($"Page: {page.Number} \"{fragment.Text}\"");
        }

        absorber.Reset();
    }
}

Is there something else I need to do to account for the page rotation? Or is there another reason the textFragmentAbsorber is not finding all the text in the file?

asad.ali · November 3, 2023, 10:13pm

@mtevebaugh

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-55827

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.