Aspose.OCR for .NET when used to OCR with PDF is very slow

Hi Team,

I am working on replacing a 3rd party software which was used to OCR images and PDF file with images in an application. I have used Aspose.OCR and Aspose.PDF in my POC application for replacement of that 3rd party, I could achieve the required output but the problem is performance.

For just 162 pdf file it is taking 1hr 15 mins.

Please find the code snippet:
public void ExtractImagesAndPerformOCR(string pdfPath)
{
// Initialize license object
Aspose.Pdf.License license = new Aspose.Pdf.License();
// License license = new License();
using (MemoryStream stream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(AsposePdfLicense)))
{
license.SetLicense(stream);
}
// Load the PDF document
Document pdfDocument = new Document(pdfPath);
int PageCount = pdfDocument.Pages.Count();
Console.WriteLine($"Page: {PageCount} ");

// Initialize Aspose.OCR engine
AsposeOcr ocrEngine = new AsposeOcr();

// Loop through each page in the PDF
for (int pageIndex = 1; pageIndex <= pdfDocument.Pages.Count; pageIndex++)
{
    Page page = pdfDocument.Pages[pageIndex];

    // Create a stream to hold the image
    using (MemoryStream imageStream = new MemoryStream())
    {
        // Convert the page to an image
        Resolution resolution = new Resolution(150);
        JpegDevice jpegDevice = new JpegDevice(resolution,75 );
        jpegDevice.Process(page, imageStream);

        // Perform OCR on the image
        imageStream.Position = 0;
        OcrInput input = new OcrInput(InputType.SingleImage, null);
        input.Add(imageStream);
        OcrOutput result = ocrEngine.Recognize(input, new RecognitionSettings());

        // Output the recognized text
        if (result != null && result.Count > 0)
        {
            Console.WriteLine($"Page {pageIndex} OCR Result:");
            Console.WriteLine(result[0].RecognitionText);
        }
    }
}

}

Please suggest the way to improve the performance of OCR.

Regards,
Ramya.B

@Ramya_Bangalore

To improve the performance of OCR when processing PDF files with Aspose.OCR, consider the following suggestions:

  1. Use Direct PDF Input: Instead of converting each page of the PDF to an image and then performing OCR, you can directly pass the PDF file to the OCR engine. This can significantly reduce processing time. For example, you can create an OcrInput object and add the PDF file directly, specifying the page range you want to process.

    Here’s a code snippet demonstrating this approach:

    public void PerformOCROnPdf(string pdfPath)
    {
        // Initialize Aspose.OCR engine
        AsposeOcr ocrEngine = new AsposeOcr();
    
        // Set recognition options
        RecognitionSettings settings = new RecognitionSettings();
        settings.Language = Language.Eng; // Set the language as needed
    
        // Create OcrInput object and add the PDF file
        OcrInput input = new OcrInput(InputType.PDF);
        input.Add(pdfPath, 0, 162); // Adjust the page range as necessary
    
        // Perform OCR
        var result = ocrEngine.Recognize(input, settings);
    
        // Output the recognized text
        foreach (var recognitionResult in result)
        {
            Console.WriteLine($"Page OCR Result:");
            Console.WriteLine(recognitionResult.RecognitionText);
        }
    }
    
  2. Adjust Resolution and Quality: If you still need to convert pages to images, consider adjusting the resolution and quality settings. A lower resolution may speed up processing, but ensure it does not compromise the accuracy of the OCR results.

  3. Batch Processing: If you have multiple PDF files, consider processing them in batches or using parallel processing to utilize multiple threads, which can help reduce overall processing time.

  4. Optimize Recognition Settings: Review and optimize the recognition settings based on your specific use case. For instance, if you are only interested in certain languages or specific types of content, adjusting these settings can improve performance.

By implementing these strategies, you should see a noticeable improvement in the performance of your OCR tasks with Aspose.OCR.

If you have further questions or need additional assistance, feel free to ask!

Thanks for quick reply.

But when I use the code given there is an exception I am facing.
System.ArgumentException
HResult=0x80070057
Message=Extended size less than origin size
Source=Aspose.OCR
StackTrace:
at .(Byte[,][] , Int32 , Int32 , Boolean )
at .(Byte[,][] , Double , Int32 )
at .(InferenceSession , List1 , Language , String , String , Boolean , Double ) at .(InferenceSession , List1 , Language , String , String , Boolean , Double )
at .(List1 , List1 , List1 ) at .(List1 )
at .(List1 , RecognitionSettings ,  , InferenceSession ) at .(List1 , RecognitionSettings ,  )
at .(RecognitionSettings ,  )
at .(RecognitionSettings ,  )
at .(Byte[,][] , String , RecognitionSettings ,  )
at .(ImageData , RecognitionSettings ,  , Action4 ) at .(ImageData , RecognitionSettings ,  , Action4 )
at .(OcrInput , RecognitionSettings ,  , Action4 , CancellationToken ) at .(OcrInput , RecognitionSettings ,  , Action4 )
at Aspose.OCR.AsposeOcr.Recognize(OcrInput images, RecognitionSettings settings)
at AsposeOCRLib.AsposeOCRApi.PerformOCROnPdf(String pdfPath) in C:\Users\qatest\Desktop\AsposeOCRLib\AsposeOCRLib\AsposeOCRLib\AsposeOCRLib\AsposeOCRApi.cs:line 200
at AsposeOCRConsole.Program.Main() in C:\Users\qatest\Desktop\AsposeOCRLib\AsposeOCRLib\AsposeOCRLib\AsposeOCRConsole\Program.cs:line 21

I also tired take the Pagecount as below:
Document pdfDocument = new Document(pdfPath);
int PageCount = pdfDocument.Pages.Count();

PDF File:
10-K Updates - (Deficit) Equity (1).pdf (644.2 KB)

I have attached the PDF above could you please let me know what I might be doing wrong.

Regards,
Ramya.B

@Ramya_Bangalore

The PDF document you shared already has searchable content and text. Can you please explain why do you need to perform OCR over it? You can simply extract text from it using Aspose.PDF API. Why you are converting pages to images and then performing OCR on it?

Hi Asad,

As you can see those PDF also have images for which we must OCR.

Also, I have already tried below code using PDF only
public void PerformOCROnPdf(string pdfPath)
{
// Initialize Aspose.OCR engine
AsposeOcr ocrEngine = new AsposeOcr();

// Set recognition options
RecognitionSettings settings = new RecognitionSettings();
settings.Language = Language.Eng; // Set the language as needed

// Create OcrInput object and add the PDF file
OcrInput input = new OcrInput(InputType.PDF);
input.Add(pdfPath, 0, 162); // Adjust the page range as necessary

// Perform OCR
var result = ocrEngine.Recognize(input, settings);

// Output the recognized text
foreach (var recognitionResult in result)
{
    Console.WriteLine($"Page OCR Result:");
    Console.WriteLine(recognitionResult.RecognitionText);
}

}

With this function when I passed the same PDF it throws exception when we do the “Recognize function” you can refer my first reply.
System.ArgumentException
HResult=0x80070057
Message=Extended size less than origin size
Source=Aspose.OCR
Hence, I tried converting to images and perform OCR, which way toooo slow but in our application performance matters a lot, the earlier 3rd party software was able to do OCR of 162 pages with 10 mins, were as Aspose.OCR is taking around 1hr 50 mins.

Please suggest how I can cope up this performance issue.

Small doubt: Also, if the above code works somehow will it OCR the images too in PDF file.

Also I would like to update that we used multithreading with Parallel.ForEach .Net feature too didnt improve the performance . I tried to replace RecognizeFast without any input settings which also didnt yield much of a performance improvement.

Regards,
Ramya.B

@Ramya_Bangalore

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRNET-1018

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.