Form fields not extracting in Aspose.PDF

GowriRB · April 22, 2025, 5:10am

I am running into two different issues, looking for expert advise.

I’ve latest Aspose.Total.Net license that I am using to consume Aspose.PDF features as well. The system is not able to extract state of form fields - Every pdf shows 0 count of form fields and so I am unable to extract values of checkbox, radio button etc

20250701

When I use OCR engine, radiobutton system is also detected as alphabet o and not as a radio button.

Aspose.OCR.License pdflicense = new Aspose.OCR.License();
string pdflicensePath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, “Aspose.Total.NET.lic”);
pdflicense.SetLicense(pdflicensePath);

foreach (string imagePath in Directory.GetFiles(imageDirectory, “*.png”))
{
try
{
// Create OcrInput object and add the image
OcrInput input = new OcrInput(InputType.SingleImage);
input.Add(imagePath);

    // Perform OCR on the image
    OcrOutput result = ocrEngine.Recognize(input, new RecognitionSettings());

Pls advise how to extract radio button and checkbox values from a pdf

GowriRB · April 22, 2025, 5:13am

I also get Argument out of exception error, with below code. Both the PDFs has 64 pages in it.
SideBySideComparisonOptions options_SD = new SideBySideComparisonOptions
{
// Set your comparison options here
AdditionalChangeMarks = true,
ExcludeTables = false
};

        SideBySidePdfComparer.Compare(pdfDocument1, pdfDocument2, "ComparisonResult_rb.pdf", options_SD);

asad.ali · April 22, 2025, 11:34am

@GowriRB

Have you checked the PDF file by opening it in Adobe Reader? Does it really have form fields? Can you please share it with us so that we can test the scenario in our environment and address it accordingly?

GowriRB · April 22, 2025, 6:41pm

Hello @asad.ali ,
Thanks for your response. Yes, the pdf renders without issues. Let me share the first page here for you to review.

GowriRB · April 22, 2025, 6:46pm

Initial Disclosure _SC123_new.pdf (1.2 MB)

Initial Disclosure_SC_base_Page1.pdf (1.3 MB)

asad.ali · April 22, 2025, 8:00pm

@GowriRB

We have checked these PDF files. They are not acroforms when open in Adobe Reader. There are no interactive features as you cannot enter or delete values in the file like you can in a form field e.g. Text box, etc. Therefore, the behavior of the API is expected. Please feel free to let us know if you have more concerns.

GowriRB · April 23, 2025, 3:21am

Thanks for this. Could you please suggest if Aspose.OMR would work on acroforms or what technique should instead be used?

asad.ali · April 23, 2025, 11:41am

@GowriRB

Aspose.OMR can only be used for bubble sheets and it involves creation of OMR templates. Additionally, it supports images rather than PDFs. Nevertheless, you can extract text from such PDFs (as they are plain text) using Aspose.PDF for .NET. Please check below sample articles for more information:

Search and Get Text from Pages of PDF|Aspose.PDF for .NET

GowriRB · April 25, 2025, 1:32am

Hello @asad.ali ,
Thanks for this, I did try out the pdf extraction using textabsorber but for the sample that I shared, its always empty text.

May be you could also run it once to check this pls?

Aspose.Pdf.Document pdfDocumentTest = new Aspose.Pdf.Document(@“f1040_EditVersionRB_Aspose.pdf”);

StringBuilder teststr = new StringBuilder();
foreach (Page page in pdfDocumentTest.Pages)
{
TextAbsorber textAbsorber = new TextAbsorber();
page.Accept(textAbsorber);
teststr.Append(textAbsorber.Text);
TextAbsorber textAbsorber_Pure = new TextAbsorber
{
ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure),
TextSearchOptions = new TextSearchOptions(true)
};
page.Accept(textAbsorber_Pure);
extractedText_asposeabsorberPure.Append(textAbsorber_Pure.Text);
TextAbsorber textAbsorber_Raw = new TextAbsorber
{
ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw),
TextSearchOptions = new TextSearchOptions(true)
};
page.Accept(textAbsorber_Raw);
extractedText_asposeabsorberRaw.Append(textAbsorber_Raw.Text);
TextAbsorber textAbsorber_Flatten = new TextAbsorber
{
ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Flatten),
TextSearchOptions = new TextSearchOptions(true)
};
page.Accept(textAbsorber_Flatten);
extractedText_asposeabsorberFlatten.Append(textAbsorber_Flatten.Text);
}

Moreover even if it works for some pdfs, It still extracts the text without the checkbox or radiobutton values.

GowriRB · April 25, 2025, 2:55am

Here is some sample response when using TextAbsorber, where the field values are missing. Could you please review and advise on this? I am attempting a pdf compare and its critical to be considerate of all the information captured in the pdf.

RawForm.png (83.1 KB)

PureForm.png (66.6 KB)

FlattenForm.png (197.2 KB)

asad.ali · April 25, 2025, 12:49pm

@GowriRB

Thanks for providing more details. We have checked the information shared by you. As per our understanding, you need to extract PDF content in a way that checkboxes (shapes) should also be extracted with its respective label/option and you need to compare entire content with another PDF. Please confirm if we understood your scenario correctly.

You can share some more screenshots, maybe of your expected outputs if we have missed something.

GowriRB · April 25, 2025, 2:09pm

Yes, I am attempting to compare two pdf files. I’ve shared the pdf as such earlier, Now the main issue is that when the Textabsorber doesnt extract fields like radio button value, suppose source pdf has US Citizen as citizenship and the modified pdf has Non-Permanent Alien selected, then ideally this is a mismatch scenario. So based on Aspose’s output I would call out both of these pdfs as non-identical. But since Aspose is not able to extract the citizenship information, both the pdfs would be called out as identical ones which is incorrect.

asad.ali · April 25, 2025, 10:12pm

@GowriRB

Thanks for the explanations. Please note that TextAbsorber class of Aspose.PDF is used to extract only text from a PDF file. Your PDF document does not have any form fields - that can be extracted and checked for their values to compare. It is flattened PDF document that only contains text and form fields as drawn shapes or objects. It is highly unlikely that API could detect which checkbox is selected against certain label because there is no actual checkbox there.

If you try to even extract the drawn shapes (flattened form fields), they will be extracted as images and sadly, Aspose.PDF won’t be able to recognize the or read those images to check whether they are checked or not. On the other hand, it wouldn’t be possible to associate those images with their parallel text inside PDF document. We are afraid that this requirement falls under the limitations of API as well as how Adobe Reader reads the PDF document.

If you are able to achieve this with Adobe Reader or any other utility, please share that with us so that we can try to analyze and check if it can be implemented in the API or not.