We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

A pdf with scanned and searchable pages

We are using RecognizePdf method to recognize text, following by AsposeOcr.SaveMultipageDocument method to save. It works if pdf only has scanned pages. However if the pdf have scanned and searchable pages, the searchable would not be recongnized and not save.
A test pdf:
Testing OCR.pdf (61.5 KB)
A result pdf:
converted-Testing OCR.pdf (17.6 KB)

Sample code:
var documentRecognitionSettings = new DocumentRecognitionSettings
{
StartPage = 0,
PagesNumber = 2,
DetectAreas = true,
AutoDenoising = true,
DetectAreasMode = DetectAreasMode.COMBINE
};

// Recognize images from PDF
List res = asposeOcr.RecognizePdf(“Testing OCR.pdf”, documentRecognitionSettings);

AsposeOcr.SaveMultipageDocument(“converted-Testing OCR.pdf”, Aspose.OCR.SaveFormat.Pdf, res);

@ChrisWongASL

At this moment we can extract only text from the image (scanned PDF). For the searchable text from PDF, you must extract using Aspose.PDF or other libraries. We are afraid that it is not supported in the API.

Hi, @ChrisWongASL

Would you please explain what exactly you are trying to accomplish? For instance, do you want to parse pdf files with hybrid content, such as, some pages with PDF text, some with scanned images, some with text layers (searcheable pdf), or maybe a mix of it, and get all the possible text from these pages in an ordered form?

Depending on the scenario, I think it would be necessary to use both Aspose.OCR and Aspose.PDF, as @asad.ali mentioned.

Thanks