Creating searchable pdfs (ocr)

betovillalobos · April 21, 2021, 7:07pm

Thank you, it worked!

BSchwab · August 4, 2021, 12:51pm

Hello, how is the current status on this topic? Is there a time horizon for the feature?

asad.ali · August 4, 2021, 5:48pm

There are three different tickets linked with this thread. Can you please point out about which you are inquiring? We will share our feedback with you accordingly.

BSchwab · August 11, 2021, 6:38am

We dont want to use tesseract.exe, it would be nice if the the “create searchable pdf feature” would be included in aspose.pdf or aspose.ocr. I guess its PDFNET-46139.

Other APIs have this feature (creating nice ocr for pdf files without using the external tesseract.exe). We are waiting for this feature in Aspose…

asad.ali · August 11, 2021, 6:24pm

@BSchwab

We definitely intend to provide this feature however, we are not certain when this is will be available as it is quite a complex feature and needs new components to be included in the API. Anyways, we have recorded your concerns and will definitely inform you once we make significant progress towards resolution of the issue. Please spare us some time.

BSchwab · September 8, 2023, 7:47am

Two years later … its still open?

asad.ali · September 8, 2023, 4:51pm

@BSchwab

We sincerely apologize for the delay in resolving your issue and the inconvenience it has caused you. We understand your frustration and we appreciate your patience and loyalty.

We want to assure you that your issue is important to us and we are working hard to find a solution as soon as possible. We have also escalated your issue to the next level of priority. We will surely inform you as soon as we have some definite updates about tickets’ resolution. We again apologize for the inconvenience.

BSchwab · January 26, 2024, 10:27am

@asad.ali

Hello,
I would like to ask again what the current status of the “Create searchable PDFs” feature is.

I found the following “advertisement” on the Aspose website - it sounds like the feature is already available and working: Aspose.OCR Scanned PDF to text for .NET | Aspose

I have tested the code, but the PDF that is created looks broken, however the text was recognized well. I could not get a satisfactory result with any of my (very simple) test pdfs

For Example
Non OCR source PDF (created in Word): no_ocr_word.pdf (31.0 KB)

Aspose Result: result.pdf (121.3 KB)

asad.ali · January 26, 2024, 6:02pm

@BSchwab

This particular feature has always been challenging because of the vast varieties in PDF format structure. It does work with many PDF documents successfully, but chances for it not creating expected results are always there because PDF can have different structure and arrangement of elements.

Nevertheless, we also noticed the issue with Aspose.OCR for .NET in our environment and have logged a ticket as OCRNET-785 in our issue tracking system to rectify it. We will surely inform you once investigation is complete and we have some feedback to share with you in this regard. We apologize for the inconvenience caused.

alexiseayy · January 29, 2024, 5:18pm

Thanks for the approach but … i´m sorry, that makes no sense for me..
.

asad.ali · January 29, 2024, 7:29pm

@alexiseayy

In case you have further concerns, please feel free to share. We will surely consider them and work on enhancing the API capabilities to perform OCR operations.

asad.ali · February 1, 2024, 8:31am

@alexiseayy

We investigated your PDF and noticed that it contains two images per page. Our library recognizes images, so it creates a PDF with one image per page. And it looks different than the original PDF, which has two images per page. We will improve the PDF creation algorithm to combine images on a page.

asad.ali · February 8, 2024, 8:16pm

@BSchwab

About OCRNET-785, there is also ability to convert pages from this PDF to images and then recognize them as the images.

Now it’s available only with Aspose.PDF license.

The example of code:

string pdfPath = @"no_ocr_word.pdf";

List<Aspose.OCR.RecognitionResult> ocrResults = new List<RecognitionResult>();
AsposeOcr api = new AsposeOcr();

// Resolution resolution = new Resolution(300);
// PngDevice imageDevice = new PngDevice(resolution);
PngDevice imageDevice = new PngDevice();
Document pdfDocument = new Document(pdfPath);

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
    using (MemoryStream ms = new MemoryStream())
    {
        // Convert a particular page and save the image to stream
        imageDevice.Process(pdfDocument.Pages[pageCount], ms);

        OcrInput input = new OcrInput(InputType.SingleImage);
        input.Add(ms);
        var recognResult = api.Recognize(input, new RecognitionSettings { DetectAreasMode = DetectAreasMode.TABLE });
        ocrResults.Add(recognResult[0]);
        ms.Close();
    }
}

AsposeOcr.SaveMultipageDocument("res.pdf", Aspose.OCR.SaveFormat.Pdf, ocrResults);

res.pdf (141.8 KB)

The result PDF is attached.
In the next release we will add this feature to Aspose.OCR library as well.

BSchwab · February 9, 2024, 8:13am

@asad.ali

Hello,
thanks for the example. I now have a paid-support ticket for this topic.

Unfortunately, the PDF result does not meet our expectations.
The PDF should not be replaced by a PDF with images, only an invisible text layer with the OCR text should be added.
In addition, the recognized text is not so good. For example, “valantic” was recognized as “volantic”.

asad.ali · February 9, 2024, 1:13pm

@BSchwab

Thanks for the feedback. We have updated the ticket information as per your comments and will keep investigating it.

asad.ali · April 2, 2024, 8:50pm

@BSchwab

The ticket OCRNET-785 has been fixed in the latest version of the API. Please feel free to create a new topic in case you need any kind of assistance.

BSchwab · April 3, 2024, 7:31am

Ok the layout of the resulting PDF now looks good but the recognized OCR text is still not usable.
I assume this is still in the making?

asad.ali · April 3, 2024, 3:59pm

@BSchwab

Can you please share the generated output PDF for our reference? We will proceed further accordingly.

BSchwab · April 5, 2024, 6:22am

Coding: Aspose.OCR’s Scanned PDF to Text Plugin | Extract Text from PDFs
Source: no_ocr_word.pdf (31.0 KB)
Result: result.pdf (98.5 KB)

asad.ali · April 5, 2024, 2:26pm

@BSchwab

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRNET-822

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.