Aspose OCR is not extracting text from pdf which is not searchable

Aspose OCR is not extracting text from pdf which is not searchable , from a multipge or single pdf only extraction is done for top few lines and the remaining is still image only, even though i have Aspose Total License , please find the code snippet using Java whereas a am using aspose-ocr-25.10.0.jar and aspose-pdf-25.8.jar.

import com.aspose.ocr.*;
import com.aspose.ocr.models.Format;
import com.aspose.pdf.License;
import java.io.IOException;
import java.util.ArrayList;

public class OCRRecognizePdf {

public static void main(String[] args) throws IOException, AsposeOCRException {
	// ExStart:1
	// The path to the documents directory.
	String dataDir ="D:/Aspose/";

	// The image path
	String file = dataDir + "siglescan.pdf";
	  // Output file path
    String outputFile = dataDir + "output.pdf";
    try {
        // Instantiate a new License object
        License license = new License();
        // Set the license from a file
        license.setLicense(dataDir + "Aspose.Total.Product.Family 3.lic");
        System.out.println("Aspose.PDF license set successfully.");
    } catch (Exception e) {
        System.out.println("Could not set the Aspose.PDF license: " + e.getMessage());
    }


	//Create api instance
	AsposeOCR api = new AsposeOCR();

	// Set preprocessing filters to rotate image before recognition.
	PreprocessingFilter filters = new PreprocessingFilter();
	filters.add(PreprocessingFilter.AutoSkew());

	// Create OcrInput object and add images/documents for recognition
	OcrInput input = new OcrInput(InputType.PDF);
	input.add(file);

	ArrayList<RecognitionResult> result = api.Recognize(input);
	 AsposeOCR.SaveMultipageDocument(outputFile, Format.Pdf, result);

        System.out.println("Full-page OCR completed. Output saved to: " + outputFile);
	System.out.println("Result: " + result.get(0).recognitionText);
}

}

@Sudha1

It seems you’re experiencing issues with Aspose OCR not extracting text from non-searchable PDFs, particularly when only a few lines are being recognized. This can occur if the OCR engine is struggling with the quality of the images or the layout of the PDF.

Here are a few suggestions to improve text extraction:

  1. Check Image Quality: Ensure that the images in the PDF are of high quality. Low-resolution images can significantly hinder OCR performance.

  2. Use Different OCR Settings: Adjust the OCR settings to improve recognition. For instance, you can configure the OcrEngine to use different languages or modify the recognition mode.

  3. Process Each Page Individually: If you’re working with a multi-page PDF, consider extracting and processing each page separately. This approach can sometimes yield better results.

  4. Update Libraries: Ensure you are using the latest versions of the Aspose OCR and Aspose PDF libraries, as updates often include performance improvements and bug fixes.

Here’s a basic code snippet to help you get started with OCR on a PDF:

import com.aspose.pdf.Document;
import com.aspose.pdf.Image;
import com.aspose.ocr.OcrEngine;

public class OcrExample {
    public static void main(String[] args) {
        // Load the PDF document
        Document pdfDocument = new Document("input.pdf");
        
        // Create an instance of OcrEngine
        OcrEngine ocrEngine = new OcrEngine();
        
        // Iterate through each page of the PDF
        for (int i = 0; i < pdfDocument.getPages().size(); i++) {
            // Convert the page to an image
            Image pageImage = pdfDocument.getPages().get(i + 1).toImage();
            
            // Set the image to the OCR engine
            ocrEngine.setImage(pageImage);
            
            // Perform OCR
            if (ocrEngine.process()) {
                System.out.println("Extracted Text: " + ocrEngine.getText());
            } else {
                System.out.println("OCR failed on page " + (i + 1));
            }
        }
    }
}

Make sure to replace "input.pdf" with the path to your PDF file. This code processes each page of the PDF individually, which may help with the extraction issue.

If the problem persists, consider reaching out to Aspose support for more tailored assistance.

But, The import com.aspose.ocr.OcrEngine cannot be resolved , in which version com.aspose.ocr.OcrEngine is available, though a using latest jar , am not getting it, Please suggest whcih version of OCR should be used.

Hi, Could you please update here, Also even with License am getting the below Trial License message after OCRing few licenses. Could you please share the updated code to use with License.

Aspose.PDF license set successfully.
Full-page OCR completed. Output saved .

Result: Chapter T Global Menta Health
424
I the Sustainable Development Goals
Mental Healt!
TABLE 10-1
ainable Development Goals
Ensure healthy lives and well-being forall a all ages.
SDG3
equets hat countieBy 230 reduceby one third prmaturemolty from
SDG Target 3.4
coeunicable diseases hrough prevention a
*********** Trial Licenses ***********

@Sudha1

We apologize for the confusion caused by automated response from the bot. The code snippet you shared in the first post is correct and supposed to be working. In case it is not generating expected results, we request you please share your sample PDF document for our reference. We will test the scenario in our environment and address it accordingly.

Hi Asad, Thanks for the reply
The code which i have sent is working after upgrading the java version to latest one,.
But, is there any limitation how many number pages to be processed, I first tried with image only pdf of 10 pages, it works fine, when i added new file with say 250 pages, am getting Exception in thread “main” java.lang.OutOfMemoryError: Java heap space , is there any solution for this and how this can be handeld, as we need process pdf file with more than 300 pages everytime and we want to run this from Linux server,
Thanks in Advance.

@Sudha1

It is expected that API would require higher memory to process larger files. You need to increase the Java Heap Size while processing such files and if issue keeps persisting, we request you to share a sample PDF file that is causing the error along with complete error description and stack trace. We will generate a ticket in our issue tracking system and investigate if any workaround or fix can be implemented to deal with such scenarios.