C# PDF Text Absorber does not extract form fields as part of the text object

vghodekar · March 7, 2025, 1:32am

Have noticed that the Text Absorber does not extract the form fields as part of the text object. Is there a way to extract all the text in a document including the field values in the place they appear when the document is viewed?
E.g. I see currently the Text appears thus: “Endorsement \r\nPolicy Number: \r\nEndorsement Effective Date: \r\n” and I need “Endorsement \r\nPolicy Number: P-001-2025 \r\nEndorsement Effective Date: 3/1/2025 \r\n”.
I can get the form fields separately, but I want these as part of the text where they are supposed to appear.

Thanks.

Professionalize.Discourse · March 7, 2025, 1:32am

@vghodekar

To extract all text from a PDF document, including the values of form fields in their respective positions, you can utilize the TextAbsorber class in Aspose.PDF for .NET. However, by default, the TextAbsorber does not include form field values in the extracted text.

To achieve your goal, you can follow these steps:

Extract Text Using TextAbsorber: First, create an instance of the TextAbsorber class and use it to extract the text from the PDF document.
Extract Form Field Values: Next, you will need to access the form fields separately to get their values.
Combine Text and Form Field Values: Finally, you can programmatically combine the extracted text with the form field values at the appropriate locations.

Here is a sample code snippet demonstrating how to extract text and form field values:

using Aspose.Pdf;
using Aspose.Pdf.Text;
using Aspose.Pdf.Forms;

// Load the PDF document
Document pdfDocument = new Document("input.pdf");

// Create a TextAbsorber instance
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);

// Extract text from the document
string extractedText = textAbsorber.Text;

// Access form fields
foreach (Field field in pdfDocument.Form.Fields)
{
    // Replace the placeholder in the extracted text with the actual field value
    extractedText = extractedText.Replace(field.PartialName, field.Value);
}

// Output the combined text
Console.WriteLine(extractedText);

In this example, the code first extracts all text from the PDF. Then, it iterates through each form field, replacing placeholders in the extracted text with the actual values of the form fields. Make sure that the placeholders in the text match the names of the form fields for the replacement to work correctly.

This approach allows you to have the form field values included in the text output in the positions they appear in the document.

If you have any further questions or need additional assistance, feel free to ask!

vghodekar · March 7, 2025, 2:10am

I tried the approach above, however, the extracted text does not contain any placeholders as indicated above. Attached is the relevant excerpt from our code. Please note that the form fields do have expected values, however, there is no placeholder in the extracted text that can be replaced as indicated in the response above.

image.png (36.3 KB)

sergei.shibanov · March 7, 2025, 4:03pm

@vghodekar
We are looking into it and will be sharing our feedback with you shortly.

asad.ali · March 7, 2025, 5:41pm

@vghodekar

You can achieve your requirements using a workaround where you can create a copy of PDF and flatten the document using Document.Flatten() method. This way all the form fields values will become plain text that can be extracted using TextAbsorber class like you are doing for remaining plain text. Please let us know if it does not help and share your sample PDF with us. We will further proceed accordingly.

vghodekar · March 7, 2025, 7:38pm

Thank you!! That worked.