How can I get Make PDF searchable. Recognize a scanned document into a searchable PDF using OCR in C#

hxb1851 · November 17, 2022, 2:33am

I have use code below, but after output.pdf file save, that still NOT searchable PDF file

// Scanned multipage PDF Path
string fullPath = “multi_page.pdf”;

// Initialize AsposeOcr class object
AsposeOcr api = new AsposeOcr();

// Recognize images from PDF
List res = api.RecognizePdf(fullPath, new DocumentRecognitionSettings
{
StartPage = 0,
PagesNumber = 1
});

// Save result as Searchable PDF
AsposeOcr.SaveMultipageDocument(“output.pdf”, SaveFormat.Pdf, res);

As the DEMO on website link below as similar as what I would like:

Please give me any advise – Thank you

asad.ali · November 17, 2022, 5:30am

@hxb1851

Can you please share your sample PDF with us as well? We will test the scenario in our environment and address it accordingly.

hxb1851 · November 17, 2022, 9:20am

69159903.pdf (181.4 KB)
Attached is my image PDF and I would like code in C# to get searchable PDF as same on web page link below

Thank you

asad.ali · November 17, 2022, 7:31pm

@hxb1851

We tested the scenario in our environment using below code snippet and obtained the attached PDF file in which a text layer was added by the API. Can you please make sure to use a valid or 30-days free temporary license and let us know in case you still notice any issues?

var api = new OCR.AsposeOcr();

var settings = new OCR.DocumentRecognitionSettings();
settings.StartPage = 0;
settings.PagesNumber = 1;
//settings.LinesFiltration = true;
settings.DetectAreas = true;
settings.DetectAreasMode = OCR.DetectAreasMode.COMBINE;
settings.ThreadsCount = 1;
settings.Language = OCR.Language.Eng;

var res = api.RecognizePdf(dataDir + "69159903.pdf", settings);
foreach(var result in res)
{
 Console.WriteLine(result.RecognitionText);
}
OCR.AsposeOcr.SaveMultipageDocument(dataDir + "69159903_OCRd.pdf", OCR.SaveFormat.Pdf, res);

69159903_OCRd.pdf (981.8 KB)

hxb1851 · November 17, 2022, 8:48pm

@asad.ali
Hi there,
Thank you for your code. I am trying to your code above to test and got issue as see below:
Microsoft.ML.OnnxRuntime.OnnxRuntimeException: ‘[ErrorCode:RuntimeException] Non-zero status code returned while running ConvInteger node. Name:‘Conv_656_quant’ Status Message: bad allocation’
Do you know why and how to resolve that issue?
Thank you again.

asad.ali · November 18, 2022, 5:23am

@hxb1851

It looks like the OnnxRuntime is not correctly installed in your project. Please change your project to x64 mode of debugging and .NET Framework should be greater than 4.6.1. Re-install the API and other dependencies using NuGet and let us know if issue is still occurring. Better to create a separate console application from scratch and share with us in case issue is persisting.

hxb1851 · November 18, 2022, 12:11pm

@asad.ali
Hi Assad.Ali,
I create standalone project, and use your code, installed NuGet package, .Net FrameWork 4.7. Testing on this the output file still same, Not searchable pdf. Look like that’s Not work at all.
I did not use license on it, Is it affecting on your code without license? Or missing some. Please any advice.
Thank you.

hxb1851 · November 18, 2022, 3:33pm

@asad.ali
One more thing as when I debug on your code, I observe that recognition text were not detected all of text on image PDF. Not sure a is it license effecting?. see attached as text was not recognition all in image PDF. Please any advise.
Thank youScreenshot 2022-11-18 085457.png (8.4 KB)

asad.ali · November 18, 2022, 8:07pm

@hxb1851

Please find the attached results from our environment while using and not using license: resutls.zip (2.0 KB)

hxb1851 · November 18, 2022, 8:38pm

@asad.ali
Hi asad.ali,
I look on both result as without license that’s not got all text from image PDF, but with license that’s more improve to get text from image PDF, so with license the result was not correct at all as that’s not get all Text from image PDF, EX: on image have 2 location text with SSN, but it was get one and missing one, and all day in image was not get there result and so far that’s not work correct as expected. Please any advise or some improve to demo with team to buy license.
Thank you

asad.ali · November 19, 2022, 3:15pm

@hxb1851

We have logged an investigation ticket as OCRNET-614 in our issue tracking system to further analyze this case. We will definitely look into its details and see how can we improve the recognition quality. We will let you know as soon as the ticket is resolved. Please be patient and spare us little time.

We are sorry for the inconvenience.

hxb1851 · November 22, 2022, 1:05pm

@asad.ali
Hi Asad.Ali,
How long the ticket OCRNET-614 will be resolve, and after that can we use without license to demo with our team as that be covert from image PDF to searchable PDF with get all text from image to searchable as that more improve and helpful for team decision to buy license. Please any advise.
Thank you

asad.ali · November 22, 2022, 7:34pm

@hxb1851

We are currently investigating the issue and it will be fixed on a first come first serve basis. We are afraid that you will not be able to use the API without any license as you will face trial version limitations. However, you can request extension to your existing temporary license in our Purchase forum in order to evaluate the API.

hxb1851 · November 28, 2022, 1:45pm

@asad.ali
Hi Asad.Ali,
I saw that OCRNET-614 status as resolved. So when the new version release or how can I use code/package to test as that case resolved on our situation. At least I can use code to run and demo with team for decision to buy license. Please any advise.
Thank you

asad.ali · November 28, 2022, 9:35pm

@hxb1851

Your ticket was not resolved in any new release. Instead we have performed investigation and resolved it in current version. Your image has very small text, unfortunately, it is hard to recognize without mistakes in such cases. What we can advise - is resizing the image before recognition. It takes more time, but the result will be better. And use the TABLE or PHOTO option

   DocumentRecognitionSettings settings = new DocumentRecognitionSettings();
            settings.DetectAreasMode = DetectAreasMode.PHOTO;
            settings.PreprocessingFilters = new PreprocessingFilter()
            {
                PreprocessingFilter.Scale(2, Aspose.OCR.Filters.InterpolationFilterType.Lanczos8)
            };

result.zip (1.6 KB)

hxb1851 · November 29, 2022, 12:14am

@asad.ali
I am trying to use code as your provide and re-test on it, but nothing improved and still issue on our situation. the text was not get all, still missing, and searchable PDF save not search for all text as need and position of search was not correct as a lot defected as look like your OCR was not worked at all.
Thank you

asad.ali · November 29, 2022, 9:23am

@hxb1851

Have you used 22.11 version of the API to test the scenario? Please share the results you achieved at your end.

hxb1851 · November 29, 2022, 12:57pm

@asad.ali
I have used latest version from NuGet package 22.11 with trial license. Attached is output pdf file as not correct as searchable with wrong position and not all text are searchable, and screenshot as text recognition Not all, that I could not demo with team to request as buy license. Please any advise.
Screenshot 2022-11-29 064711.zip (4.6 MB)

Thank you,

asad.ali · November 29, 2022, 8:24pm

@hxb1851

One more thing, have you tried to perform this operation using a valid 30-days free temporary license?

hxb1851 · November 29, 2022, 9:11pm

@asad.ali
I am trying request 30-Day free temporary license, but I could not get it. so please can you test on your side with our 2 scan image PDF, that both contain 2 location of SSN, if both of output searchable PDF can search for 2 location of SSN, then give me result that I may discuss with our team to consider buy license for our product.
Attached is zip of 2 scan PDF, please help.
69159903.zip (283.4 KB)

Thank you,