Unable to extract all text from page

Hi
I’m working on a project where I need to extract data from PDFs in a fairly predictable format, and am currently testing Aspose.PDF.net for this. So far it has worked great, with one exception: Some of the pages has meta data text that I need located out in the margin of the page, and this text is not returned by the TextAbsorber. Unfortunately the PDFs are confidential so I can’t give actual examples.

This text is not selectable with the mouse in Acrobat Reader either (unlike the rest of the text on the page), so at first I assumed it was an image. But the text turns out to be searchable in Acrobat Reader, so it should be possible to extract it.

One theory I had was that they are actually annotations so I used the PdfAnnotationEditor.ExtractAnnotations() method and found that the page has 4 WatermarkAnnotations which correponds to the 4 unselectable lines of text in the margin, but the annotations all have Contents == null. Could I be using the Annotations wrong? (I also tried two other PDF libraries with the same result)

Any ideas to what could be the issue here?
Is there maybe some other place (other than page content or annotation content) that text can be located?

The text in question is repeated across several pages. Could it be stored somewhere else in the document than on the page level? I’m not really familiar with the PDF format, so not sure what is possible.

Here is some example code of what I’m doing:

    public void GetText(string path, int pageNumber)
    {
        var pdf = new Aspose.Pdf.Document(path);
        var page = pdf.Pages[pageNumber];

        var ab = new TextAbsorber();
        ab.Visit(page);

        var pageText = ab.Text;
        
        var ae = new PdfAnnotationEditor();
        ae.BindPdf(pdf);
        var annotations = ae.ExtractAnnotations(pageNumber, pageNumber, new[] { AnnotationType.FreeText, AnnotationType.Text, AnnotationType.Watermark });

        var at = annotations.Select(a => $"{a.AnnotationType}: '{a.Contents}'");
        var annotationText = string.Join(Environment.NewLine, at);
    }

Running this method results in the pageText variable containing the “normal” text but not the 4 lines in the margin.
The annotationText variable has the following value:

Watermark: ''
Watermark: ''
Watermark: ''
Watermark: ''

Hi
I am also working on a PDF data extraction project and we were planning to use Aspose for this, but I encountered the same problem as above when testing with a trial license. The content for annotations is always null.

Is there any progress on this issue? Or is there a workaround/another way to get the text from annotations?

Update: I tried extracting the annotations with a couple of other pdf tools, and they also gave empty annotation contents. So maybe my problem is not related to this issue.

@rbwbstp

Would you kindly share your sample PDF document with us as well so that we can test the scenario in our environment and address it accordingly.

Hi
Unfortunately the PDFs are confidential so I can’t give samples.

@rbwbstp

We are afraid that we cannot share our feedback without testing the scenario in our environment. In order to investigate and determine the issue cause, we need sample PDF document. In case you cannot share it publicly, you can share it in a private message as shown in the attached image. We assure you that we do not disclose or share your files with anyone and we remove them from our system once issue is investigated. image.png (16.1 KB)

I understand. I will check to see if I can find some part of a document which has this issue that can be shared. But these documents comes from a client of my client, so I can’t make that decision myself.

@rbwbstp

Sure, please take your time to gather the sample PDF and share with us so that we can further proceed to assist you accordingly.

I have now sent you a sample PDF in a private message as you suggested, containing just one of the pages where I am unable to extract all the text. The TextAbsorber only returns two lines of text (and some whitespace) for this document (the text in the center of the page), and not the four lines of text out in the margins.

The PdfAnnotationsEditor returns 4 watermark annotations, but they all have Contents == null

@rbwbstp

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54562

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.