Cannot extract text from Form XObject

WPInfo · December 21, 2021, 3:26am

A pdf file, its real content on each page is in Form XObject of the resources of that page, like this:

v0.png (15.1 KB)
v2.png (70.0 KB)

By the following code, the sdk could get the Form XObject, but cannot get the content

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(filePath);
Stopwatch sw = new Stopwatch();
var pageCount = pdfDocument.Pages.Count;
result.DocumentPageNumber = pageCount;

var p5 = pdfDocument.Pages[5];
var forms = pdfDocument.Pages[5].Resources.Forms;
var form = forms[1];
var ab = new TextAbsorber
{
    TextSearchOptions = new Aspose.Pdf.Text.TextSearchOptions(false)
    {
        SearchForTextRelatedGraphics = true,
        Rectangle = form.BBox
    },
    ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure)
};

var bc = new TableAbsorber();

ab.Visit(form);
var c = ab.Text;

And the Content property of the text absorber is right, but the Text is empty:

v4.png (31.1 KB)
v3.png (56.7 KB)

How to extract text from the Form XObject? Is there anything wrong with the code?

Furthermore, TextAbsorber support Visit an XForm:

public class TextAbsorber
{
    //
    // Summary:
    //     Extracts text on the specified XForm.
    //
    // Parameters:
    //   form:
    //     Pdf form object.
    public virtual void Visit(XForm form);
}

But TableAbsorber doesn’t have such a Visit override. It only supports Visit(Page). If there is a table on an Form XObject, how to extract the table?

Thanks.

tahir.manzoor · December 21, 2021, 2:36pm

@WPInfo

Could you please attach your input PDF file here for testing? We will investigate the issue and provide you more information on it.