PDF Scanned Images to PDF Searchable

Is there a current example to convert a scanned image pdt to a searchable pdf? We do not want to use Google’s Tesseract application. I found an example with Aspose.OCR in standard .net but we need something for .net framework.

Scanned PDF to Searchable PDF with OCR in C# | Recognize Text PDF (aspose.com)

@mrs99mrs99

It is not necessarily required to Aspose.OCR for .NET in .netstandard to perform OCR on scanned PDF. You can use the same code snippet in .NET Framework as well.

Which version of Aspose.OCR are you using to make Aspose.OCR for .NET|Documentation work? I’m using 19.9.3 and the code does not compile.

The code mentions ```
string fullPath = dataDir + “OCR.pdf”;

**fullPath** isn't used.

It also mentions **imgPath** in this code and **imgPath** isn't declared anywhere.

List<RecognitionResult> result = api.RecognizePdf(**imgPath**, set);

@mrs99mrs99

The feature was added in 21.8 version of the API. However, as always recommended, please use the latest version of the API in order to use this feature.

The latest version of Aspose.OCR does not support .net Framework. It only supports .net Standard. Please see attached.

image.png (14.7 KB)

@mrs99mrs99

You are seeing this dependency because we excluded the System.Drawing dependency, that was the main change. And we used Aspose.Drawing .netstandard instead. This dependency will also work for .NET Framework as we tested the latest version of the API already with .NET Frameworks. Please feel free to install it using NuGet Package Manager and let us know if you face any errors.

I get an error that I can’t install Aspose.OCR b/c of version conflict.

image.png (6.4 KB)

I created a .net standard client and am looking at the code. I don’t think this code will allow me to create a searchable pdf from a pdf with images.

Can this be done? How?

I’ve tried the .net standard Aspose.OCR example code listed here: Scanned PDF to Searchable PDF with OCR in C# | Recognize Text PDF (aspose.com).

That’s not making the pdf searchable. It’s just returning text from the pdf in List< RecognitionResult > collection.

Do you have code that does this?

[Make PDF searchable (aspose.app)](https://products.aspose.app/pdf/make-pdf-searchable)

@mrs99mrs99

Please use the below sample code to obtain a searchable PDF as an output:

// C# Code
var api = new OCR.AsposeOcr();

var settings = new OCR.DocumentRecognitionSettings();
settings.StartPage = 1;
settings.PagesNumber = 1;
settings.DetectAreas = true;
settings.DetectAreasMode = OCR.DetectAreasMode.COMBINE;
settings.ThreadsCount = 1;

var res = api.RecognizePdf(dataDir + "multi_page_1.pdf", settings);

OCR.AsposeOcr.SaveMultipageDocument(dataDir + "test2.pdf", OCR.SaveFormat.Pdf, res);

Also, you should have minimum 4.6.1 .NET Framework version to use the API under .NET Framework.

I have used the code you specified and it is not recognizing all of the text. I have attached the sample pdf I’m using. Please test with it.

Nothing is recogized on the first page. The second page already has a couple of lines of searchable text but after conversion the original searchable text is missing and the rest of the page has only some text converted to searchable.

Can you advise on how to get this to work?File1.PDF (443.6 KB)

@mrs99mrs99

Please check the below code snippet that we used for testing and attached output PDF that was generated at our end:

var api = new OCR.AsposeOcr();

var settings = new OCR.DocumentRecognitionSettings();
settings.StartPage = 0;
settings.PagesNumber = 6;
//settings.LinesFiltration = true;
settings.DetectAreas = true;
settings.DetectAreasMode = OCR.DetectAreasMode.COMBINE;
settings.ThreadsCount = 1;

var res = api.RecognizePdf(dataDir + "File1.pdf", settings);

OCR.AsposeOcr.SaveMultipageDocument(dataDir + "File1_OCRd.pdf", OCR.SaveFormat.Pdf, res);

File1_OCRd.pdf (6.4 MB)

You can please select the text from all pages of the attached file and paste it in a Notepad file to check the results of OCR. Please let us know in case you notice any issues.

This is the same code given to us 3 days ago which does not work. What is the difference? I looked at the file (File1_OCRd.pdf) that you send back and it’s not recognizing all of the text. The first work “Document” is only partially recognized. There is searchable text on the original form that is missing on the file you sent back. First name, last name, social security number etc… are all missing on the pdf you sent back.

Please advise of Aspose can create a valid searchable pdf.

This web page from Aspose is working. Do you have the code for this?

https://products.aspose.app/pdf/make-pdf-searchable

@mrs99mrs99

We are checking the scenario and will get back to you shortly.

@mrs99mrs99

In web applications (https://products.aspose.app/pdf/make-pdf-searchable), we use Aspose.OCR Cloud API and in Cloud API, we use more hard decisions to parse PDF files. Unfortunately, we cannot do this in the downloadable version. At this moment, we can extract only text from the image (scanned PDF). The searchable text from PDF must be extracted using Aspose.PDF or other tools. We apologize for the inconvenience.