A pdf file, its real content on each page is in Form XObject of the resources of that page, like this:
v0.png (15.1 KB)
v2.png (70.0 KB)
By the following code, the sdk could get the Form XObject, but cannot get the content
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(filePath);
Stopwatch sw = new Stopwatch();
var pageCount = pdfDocument.Pages.Count;
result.DocumentPageNumber = pageCount;
var p5 = pdfDocument.Pages[5];
var forms = pdfDocument.Pages[5].Resources.Forms;
var form = forms[1];
var ab = new TextAbsorber
{
TextSearchOptions = new Aspose.Pdf.Text.TextSearchOptions(false)
{
SearchForTextRelatedGraphics = true,
Rectangle = form.BBox
},
ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure)
};
var bc = new TableAbsorber();
ab.Visit(form);
var c = ab.Text;
And the Content property of the text absorber is right, but the Text is empty:
v4.png (31.1 KB)
v3.png (56.7 KB)
How to extract text from the Form XObject? Is there anything wrong with the code?
Furthermore, TextAbsorber support Visit an XForm:
public class TextAbsorber
{
//
// Summary:
// Extracts text on the specified XForm.
//
// Parameters:
// form:
// Pdf form object.
public virtual void Visit(XForm form);
}
But TableAbsorber doesn’t have such a Visit override. It only supports Visit(Page). If there is a table on an Form XObject, how to extract the table?
Thanks.