Non OCR PDF With Images and text to convert to OCR PDF

Nawaz_Ahmed · November 7, 2024, 12:26pm

Hi

How can we convert Non-OCR PDF Documents to OCR PDF DOcuments when the document contains both Images and Text.

Also Aspose.OCR Supports OCR of Chinese documents with images and chinese text?

Professionalize.Discourse · November 7, 2024, 12:26pm

To convert non-OCR PDF documents that contain both images and text into OCR PDF documents, you can utilize Aspose.OCR. This library allows you to recognize text from PDF files, effectively creating a searchable and indexable document.

Here’s a general approach to achieve this:

Initialize the Aspose.OCR API: Create an instance of the OCR API.
Load the PDF Document: Use the OcrInput class to add your PDF document for recognition.
Set Recognition Options: You can specify language settings to improve accuracy, especially if your document contains text in different languages.
Perform OCR: Call the recognition method to process the PDF and extract text.
Output the Results: The recognized text can be saved or displayed as needed.

Here’s a sample code snippet in C# that demonstrates how to recognize text from a PDF:

using System;
using System.Collections.Generic;
using Aspose.OCR;

namespace Aspose.OCR.Examples
{
    public class RecognizePdf
    {
        public static void Run()
        {
            // Initialize an instance of AsposeOcr
            AsposeOcr api = new AsposeOcr();

            // Create OcrInput object and add the PDF document
            OcrInput input = new OcrInput(InputType.PDF);
            input.Add("path_to_your_pdf.pdf", startPage: 1, pagesCount: 1); // Adjust page range as needed

            // Perform OCR
            List<RecognitionResult> results = api.Recognize(input, new RecognitionSettings { });

            // Print the results
            foreach (var result in results)
            {
                Console.WriteLine($"Recognized Text: {result.RecognitionText}");
            }
        }
    }
}

Regarding your second question, yes, Aspose.OCR supports OCR for Chinese documents that contain images and Chinese text. You can specify the language during the recognition process to ensure accurate text extraction.

For more detailed guidance, you can refer to the official documentation on creating searchable PDFs from scanned documents, which outlines the process and provides additional examples here and here for Java-specific operations.

If you have any further questions or need additional assistance, feel free to ask!

Sources:
[1]: Creating a searchable PDF from the scan - Aspose Documentation
[2]: OCR Operations | Aspose.OCR Java API

Nawaz_Ahmed · November 7, 2024, 12:52pm

Unable to OCR Chinese PDF Document to PDF.

getting error “Exception of type ‘System.Exception’ was thrown”

asad.ali · November 8, 2024, 1:35pm

@Nawaz_Ahmed

Would you kindly share the sample file that you are trying to work with. We will test the scenario in our environment and address it accordingly.