Creating searchable pdfs (ocr)

alexiseayy · January 29, 2024, 5:18pm

Thanks for the approach but … i´m sorry, that makes no sense for me..
.

asad.ali · January 29, 2024, 7:29pm

In case you have further concerns, please feel free to share. We will surely consider them and work on enhancing the API capabilities to perform OCR operations.

asad.ali · February 1, 2024, 8:31am

@alexiseayy

We investigated your PDF and noticed that it contains two images per page. Our library recognizes images, so it creates a PDF with one image per page. And it looks different than the original PDF, which has two images per page. We will improve the PDF creation algorithm to combine images on a page.

asad.ali · February 8, 2024, 8:16pm

@BSchwab

About OCRNET-785, there is also ability to convert pages from this PDF to images and then recognize them as the images.

Now it’s available only with Aspose.PDF license.

The example of code:

string pdfPath = @"no_ocr_word.pdf";

List<Aspose.OCR.RecognitionResult> ocrResults = new List<RecognitionResult>();
AsposeOcr api = new AsposeOcr();

// Resolution resolution = new Resolution(300);
// PngDevice imageDevice = new PngDevice(resolution);
PngDevice imageDevice = new PngDevice();
Document pdfDocument = new Document(pdfPath);

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
    using (MemoryStream ms = new MemoryStream())
    {
        // Convert a particular page and save the image to stream
        imageDevice.Process(pdfDocument.Pages[pageCount], ms);

        OcrInput input = new OcrInput(InputType.SingleImage);
        input.Add(ms);
        var recognResult = api.Recognize(input, new RecognitionSettings { DetectAreasMode = DetectAreasMode.TABLE });
        ocrResults.Add(recognResult[0]);
        ms.Close();
    }
}

AsposeOcr.SaveMultipageDocument("res.pdf", Aspose.OCR.SaveFormat.Pdf, ocrResults);

res.pdf (141.8 KB)

The result PDF is attached.
In the next release we will add this feature to Aspose.OCR library as well.

BSchwab · February 9, 2024, 8:13am

@asad.ali

Hello,
thanks for the example. I now have a paid-support ticket for this topic.

Unfortunately, the PDF result does not meet our expectations.
The PDF should not be replaced by a PDF with images, only an invisible text layer with the OCR text should be added.
In addition, the recognized text is not so good. For example, “valantic” was recognized as “volantic”.

asad.ali · February 9, 2024, 1:13pm

@BSchwab

Thanks for the feedback. We have updated the ticket information as per your comments and will keep investigating it.

asad.ali · April 2, 2024, 8:50pm

@BSchwab

The ticket OCRNET-785 has been fixed in the latest version of the API. Please feel free to create a new topic in case you need any kind of assistance.

BSchwab · April 3, 2024, 7:31am

Ok the layout of the resulting PDF now looks good but the recognized OCR text is still not usable.
I assume this is still in the making?

asad.ali · April 3, 2024, 3:59pm

@BSchwab

Can you please share the generated output PDF for our reference? We will proceed further accordingly.

BSchwab · April 5, 2024, 6:22am

Coding: Aspose.OCR’s Scanned PDF to Text Plugin | Extract Text from PDFs
Source: no_ocr_word.pdf (31.0 KB)
Result: result.pdf (98.5 KB)

asad.ali · April 5, 2024, 2:26pm

@BSchwab

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRNET-822

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.