Detecting corrupt PDFs

I searched the forums and the latest reply about corrupt PDF detection was back in 2015.
I’m using Aspose.PDF 18.2 for .NET 4.0 and I’m parsing random PDFs and I’ve noticed that it will load a malformed / corrupt PDF and not throw any exceptions for most malformed PDFs I’ve encountered or created (by overwriting bytes in a hex editor etc.)

I have a requirement to be able to detect and reject corrupt PDFs (PDFs that will not render in say Acrobat).
Is there a method or approach I could take using Aspose.PDF to enumerate the contents of the PDF somehow and detect corruption (either in a try/catch looking for exceptions or through some property on the object model that indicates there were validation failures?)

Thanks in advance for any advice / pointers you can give me here.

@johnsmith111155

Thank you for contacting support.

I would like to share with you that you can check whether the source input is a valid PDF file or not, by using IsPdfFile property as in the code sample below:

        PdfFileInfo info = new PdfFileInfo(dataDir + "Sample Response.txt");

        if (info.IsPdfFile)
        {
            Console.WriteLine("Valid PDF file");
        }
        else
        {
            Console.WriteLine("Invalid PDF file");
        }

I hope this will be helpful. If this does not satisfy your requirements then please share corrupted PDF file with us so that we may investigate to help you out.

Thanks - it doesn’t really satisfy the requirements but i don’t think that’s your fault.
I’m looking for a fairly fool proof method of detecting corrupt PDFs but that’s sort of a vague requirement.
For example if an image inside of a PDF is corrupt the PDF will render but may display an error in Acrobat when you flip to the page with the image.

What I ended up doing was a multi-pronged approach.
I did your recommendation above as my first check
Then I tried to extract all text as my second check
Then I tried converting the entire PDF to a TIFF as my third check
I may have done some random other things as well.
I don’t think there’s a single silver bullet here.

@johnsmith111155

Would you please share the source PDF files you are referring to by mentioning “detecting corrupt PDFs”. Please also mention the problems you want to detect in those specific files so that we may investigate further to help you out.

Hi Farhan,
I used the above code and tested with an invalid PDF, but it is allowing me to upload.

Please verify once.
Test1_Corrupted.pdf (89.6 KB)

Test1_Corrupted.pdf (89.6 KB)

@KavithaReddy10

By uploading, would you please share what you actually meant? Are you trying to upload the PDF file to some server? OR you just want to determine whether its valid or not?

Thank you for your response.
I am trying to determine valid or not

we have one more issue, We are using below code to check pdf content has any format exception, some pdfs tagged content is timing out. Please let me know why or do you have any other solution to check the content of pdf is valid or not.

Aspose.Pdf.Document pdfDoc = new Aspose.Pdf.Document(uploadedFile.InputStream);
double Width = pdfDoc.PageInfo.Width;
double Height = pdfDoc.PageInfo.Height;
if (pdfDoc.TaggedContent.ToString().Contains(“FormatException”))
contentMatched = false;
else
contentMatched = true;

@KavithaReddy10

We tested the case using 22.9 version of the API and did not notice any issue. The API returned false for the property pdfFileInfo.IsPdfFile. Can you please make sure to use 22.9 version and let us know in case you still face any issues?

Can you please share a sample PDF for this case as well?

Hi Ali,
Can you please provide me EmailId,I will send PDF via email.

Thanks,
Kavitha

@KavithaReddy10

You can share your file in a private message. You can click on the top left button in post editor to convert your post into a private message where you can share your file. We will further proceed to accordingly. image.png (8.8 KB)

22-1017-2022-DSNP-Connected-Access-Formulary.pdf (2.4 MB)

Please find the attachment for which I am facing the issue.

@KavithaReddy10

We are checking it and will get back to you shortly.

@KavithaReddy10

We were able to notice this issue in our environment. Therefore, it has been logged under the ticket ID PDFNET-52742 in our issue tracking system for further investigation. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

Hi Ali,
Any update on this issue or any alternative solution?

If we create a problematic pdf once again will that issue solved?

Thanks

@KavithaReddy10

The ticket has recently been logged in our issue tracking system and is pending for initial analysis. We will investigate and resolve it on a first come first serve basis. We are afraid that we cannot comment further without determining the actual cause of the issue. As soon as we make some definite progress towards ticket resolution, we will share updates with you via this forum thread. Please spare us some time.

We are sorry for the inconvenience.