Aspose OCR is slow while Converting unsearchable PDF to searchable PDF

I am using Aspose OCR in my code to convert unsearchable scanned PDF to searchable PDF. The execution time of the code for scanning and converting just 1 page is 2 minutes. So, how Can I optimize the speed of the execution.

@Krishalini_H

To optimize the speed of converting unsearchable scanned PDFs to searchable PDFs using Aspose OCR, consider the following strategies:

  1. Use Fast Mode: If your application allows, switch to a faster OCR mode. This can significantly reduce processing time, especially for simpler documents.

  2. Image Preprocessing: Before running OCR, preprocess the images to enhance quality. This can include resizing, converting to grayscale, or increasing contrast, which can help the OCR engine recognize text more efficiently.

  3. Batch Processing: If you have multiple pages, consider processing them in batches rather than one at a time. This can reduce overhead and improve overall performance.

  4. Optimize Image Resolution: Ensure that the images are not excessively high in resolution. A resolution of around 300 DPI is generally sufficient for OCR tasks.

  5. Use Multi-threading: If your environment supports it, implement multi-threading to process multiple pages simultaneously, which can significantly speed up the overall conversion time.

Here’s a simple code snippet to illustrate how you might implement some of these optimizations:

using Aspose.OCR;
using Aspose.OCR.ImageProcessing;

public void ConvertPdfToSearchablePdf(string inputPdfPath, string outputPdfPath)
{
    // Load the document
    using (var document = new Aspose.Pdf.Document(inputPdfPath))
    {
        // Create an OCR engine
        OcrEngine ocrEngine = new OcrEngine();

        foreach (var page in document.Pages)
        {
            // Preprocess the image (e.g., convert to grayscale)
            var image = page.ToImage();
            image = ImagePreprocessing.ConvertToGrayscale(image);

            // Set the OCR engine to fast mode
            ocrEngine.Settings.UseFastMode = true;

            // Perform OCR
            var result = ocrEngine.RecognizeImage(image);
            // Save or process the result as needed
        }

        // Save the searchable PDF
        document.Save(outputPdfPath);
    }
}

By implementing these strategies, you will see a noticeable improvement in the speed of your OCR processing. Please proceed with these optimizations and let me know if you need further assistance!

using Aspose.OCR;
using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace ConsoleApp3
{
public class Program
{
static void Main(string[] args)
{
Aspose.OCR.License license = new Aspose.OCR.License();
license.SetLicense(“Aspose.Total.NET.lic”);

        string ipPath = "Sample.pdf";
        string opPath = "SamplePdfOutput.pdf";
        
        OcrInput pdf = new OcrInput(InputType.PDF);
        pdf.Add(ipPath);
        RecognitionSettings settings = new RecognitionSettings()
        {
            Language = Language.Eng,
            ThreadsCount = 0,             // Use all cores
            //UpscaleSmallFont = true,   // Optional: improves small text
            DetectAreasMode = DetectAreasMode.LEAN,
            LinesFiltration = false,
            AutomaticColorInversion = false
        };
        AsposeOcr api = new AsposeOcr();
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        Console.WriteLine("OCR started");
        List<RecognitionResult> result = api.Recognize(pdf,settings);
        //List < RecognitionResult> result = new List<RecognitionResult>();
        Console.WriteLine("OCR done");
        AsposeOcr.SaveMultipageDocument(opPath, Aspose.OCR.SaveFormat.Pdf, result);
        Console.WriteLine("Conversion Done");
        stopwatch.Stop();
        Console.WriteLine($"Execution Time :{ stopwatch.ElapsedMilliseconds} ms");
        Console.ReadLine();
    }
}

}
This is the code currently I am working with. This code takes 2 mins to process 1 page PDF.
Coming to the code what you have provided, OcrEngine() and page.ToImage() is no longer available in Aspose OCR and Aspose PDF. I am using Aspose.OCR version 25.7.0 and Aspose.PDF version 25.7.0, in this version these methods are not supported. So, Kindly give solutions which would be compatible with the latest versions of Aspose OCR and Aspose PDF.

@Krishalini_H

Would you kindly share your sample PDF document for our reference as we need to investigate the performance related issue that you are facing. We will register a ticket in our issue tracking system and share the ID with you.

Uploading my sample 1 page PDF with which I am testing for your reference.
inputFile.pdf (501.0 KB)
Kindly raise the ticket and share the ID.

@Krishalini_H

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRNET-1080

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

Hi, I have a paid support contract. I have order ID, user ID how should I raise a paid support ticket. Which credentials should I use to login to the paid support portal. Kindly let me know ASAP.

@Krishalini_H

You can use same email address which was used to purchase the subscription of paid support. Once you have access to it, you can create a topic there with the reference to ticket ID shared in this forum thread. The issue will be raised to the highest priority accordingly.

Hi,
Is there any updates on this Issue ID - OCRNET-1080. Can I know the status of this Issue? Let me know kindly

@Krishalini_H

We have reviewed your file and code, and here are the processing times on our systems:

  • Up to 20 seconds on an Intel Core i7 (2.7GHz, 4 cores) with 32 GB RAM.
  • 6.4 seconds on an AMD Ryzen 5 3600 (6 cores @ 3.60 GHz) with 32 GB RAM.

Could you share details about the hardware you’re using to run the code? Have you tried executing it multiple times? Additionally, please confirm whether you’re using Release or Debug mode, your .NET version, and any other relevant runtime settings.

Hi @asad.ali ,
Now, there is a big difference in the execution time. In your system it takes less than 20 seconds but for me, it takes nearly or more than 2 minutes.
Coming to my hardware specifications:

  • This is of Windows 11 Enterprise (4 processor) 16 GB RAM.
  • I also tried running multiple times and getting the same execution time.
  • I use only debug mode and not release mode.
  • The .NET version what I use is Microsoft .NET framework 4.8.09032.

@Krishalini_H

The correct way to measure the API performance is to run the program in release mode. Would you please try once in release mode and let us know what results you got from it? We will further proceed accordingly. Also, please make sure that you use the latest available version of the API.

Hi,
I’m a developer on the Aspose.OCR team.
I have measured the recognition time on different hardware using .NET 8 (results attached). The test file is also attached.

Please run this file on your system. If possible, use .NET 6.0 or higher.

Thank you!
test.pdf (706.6 KB)

recognition_time.png (54.1 KB)