Unable to get checkbox value from pdf

apurva.shah · August 14, 2017, 10:32am

I am trying to extract checkbox value from non fillable vector pdf. I am able to read the value of check box where Acrobat reader allows to select checkbox value(x). If some has used chrome browser to fill the pdf data including selecting checkbox value in fillable pdf and saved it as pdf, checkbox convert into shape and I am not getting any property to read the check box value.

imran.rafique · August 14, 2017, 3:57pm

@apurva.shah,
Can you retrieve the checkbox value when the PDF document is filled with any other browser? I mean is it a browser specific problem? Kindly share the complete details of the use case, including source PDF, code snippet which you are using to fill data, final PDF from which you are unable to retrieve the checkbox value. We will investigate and share our findings with you. Your response is awaited.

Best Regards,
Imran Rafique

apurva.shah · August 16, 2017, 4:11am

Hi Imran

Thank you very much for quick reply. I am attaching here 1 fillable PDF (TestFillableform_01.pdf) and 1 flattend PDF(TestNonFillableFlattened_01.pdf). From fillable pdf, I am able to read the value using Page control’s value. We have recorded recorded the coordinates(x,y) of each field on the form. When we receive non fillable form, we try to get value using Text absorber using Rectangle and code given at Extract Text from PDF using C#|Aspose.PDF for .NET .

Now on fillable form, we are able to read checkbox on/off state easily using control’s value but when someone flattens the PDF using aspose pdf’s document.flatten() command or Acrobat Pro or opens in Chrome browser and fills the value and save it as pdf, checkbox state is being saved as Shape(I believe) and I couldn’t find any method to read that. Text absorber class just gives us /r/n as value for checkbox coordinates.TestFillableform_01.pdf (155.6 KB)
TestNonFillableFlattened_01.pdf (173.8 KB)

imran.rafique · August 16, 2017, 7:34am

@apurva.shah,
Thank you for the details. You can record rectangle region of the checkbox, convert that region to an image, and then use Aspose.OCR for .NET API to detect text characters. In your scenario, if the checkbox is checked, then Aspose.OCR for .NET API will detect “X” character. Please try the following code:

[C#]

// load source PDF
Document document = new Document(@"C:\Pdf\test223\TestFillableform_01.pdf");
// Get a field by name
CheckboxField checkBoxField = document.Form["F[0].P1[0].NamedInsured_LegalEntity_CorporationIndicator_A[0]"] as CheckboxField;

Console.WriteLine(checkBoxField.Value);
// record the rectangle region
Aspose.Pdf.Rectangle rect = checkBoxField.Rect;

// load flattened PDF document
Document temp = new Document(@"C:\Pdf\test223\TestNonFillableFlattened_01.pdf");

// Get rectangle of particular page region
Aspose.Pdf.Rectangle pageRect = rect;
// Set CropBox value as per rectangle of desired page region
temp.Pages[1].CropBox = pageRect;
// Save cropped document into stream
MemoryStream ms = new MemoryStream();
temp.Save(ms);

// Open cropped PDF document and convert to image
temp = new Document(ms);
// Create Resolution object
Resolution resolution = new Resolution(300);
// Create PNG device with specified attributes
PngDevice pngDevice = new PngDevice(resolution);
MemoryStream imgStream = new MemoryStream();
// Convert a particular page and save the image to stream
pngDevice.Process(temp.Pages[1], imgStream);
ms.Close();

// Initialize an instance of OcrEngine
Aspose.OCR.OcrEngine ocrEngine = new Aspose.OCR.OcrEngine();

//Set Image property of OcrEngine to the stream obtained from previous step
ocrEngine.Image = Aspose.OCR.ImageStream.FromStream(imgStream, Aspose.OCR.ImageStreamFormat.Png);

//Perform OCR operation on one page at a time
if (ocrEngine.Process())
{
    Console.WriteLine(ocrEngine.Text);
}

Kindly let us know in case of any confusion or questions.

Best Regards,
Imran Rafique