Detecting Source/Type of PDF

Hello.
Is it possible to recognize whether a PDF is a text-based PDF (created from Word for example) or whether it is a scanned document based on images (perhaps even with OCR recognized text)?

Best Regards

@BSchwabVal

Yes, it is possible to determine whether a PDF is text-based or a scanned document using Aspose tools.

For text-based PDFs, which are typically created from applications like Word, the text is stored as selectable and searchable content. In contrast, scanned documents are essentially images of text, and they may require Optical Character Recognition (OCR) to extract any text.

To recognize the type of PDF, you can utilize Aspose.OCR for Java. This library allows you to perform OCR on PDF documents, which can help identify if the content is image-based. Here’s a brief overview of how you can implement this:

  1. Set Up Your Project: Ensure that you have the Aspose.OCR library included in your Java project.

  2. Specify the PDF Document Path: Define the path to the PDF document you want to analyze.

  3. Create an API Instance: Instantiate the Aspose.OCRPdf class to create an instance for processing.

  4. Set Recognition Options: Configure any necessary recognition options, such as language settings.

  5. Perform OCR Recognition: Execute the OCR process on the PDF document. If the document is a scanned image, the OCR will extract the text, indicating that it was not originally text-based.

  6. Analyze Results: Review the recognition results to determine the nature of the PDF content.

By following these steps, you can effectively identify whether a PDF is text-based or a scanned document requiring OCR for text extraction.

For more detailed guidance, you can refer to the tutorial on recognizing PDF documents using Aspose.OCR for Java here.

Sources:
[1]: OCR Recognizing PDF Documents in Aspose.OCR for Java

Thanks for the instructions, but unfortunately we use .net/C# and not Java.

@BSchwabVal

Please check below documentation article in order to achieve your requirements and let us know if you face any issues:

Hello.
Thanks for link. But I don’t think this will work

A scanned document with OCR will have both: Text and images. A text-based document created from Word, for example, can also have text and images

@BSchwabVal

This needs to be investigated. Can you please share a sample document as well? We will log an investigation ticket and share the ID with you.

This document here has a text layer and is image based and based on a scanned document.

It should be recognized that this is not a text-based document created from Word

dmg_aspose.pdf (1,6 MB)

@BSchwabVal

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-57985

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.