Questions about OCR

AStelzner · June 27, 2023, 10:24am

Hi,

we are evaluating Aspose.OCR to use it instead of another 3rd Party tool, but we have some questions

1.) When i set OnlyText.pdf as OcrInput, the result PDF is completely empty, why?
2.) When i set Screenshot.pdf as OcrInput, the result PDF contains no OCR data to select, why?

We use only PDFs for as OCR input, but the PDFs are a result of conversion from Word, Excel or any other format to PDF. Therefor we must be sure, that the example use cases 1.) und 2.) works fine

Kind Regards,
Andy

asad.ali · June 27, 2023, 8:12pm

@AStelzner

Please note that Aspose.OCR only processes scanned PDF documents. The PDFs that contains text, cannot be processed using Aspose.OCR and you need to use Aspose.PDF in order to extract text from them.

Furthermore, we were able to notice that Screenshot.pdf was not correctly processed by Aspose.OCR. Therefore, we have logged an issue as OCRNET-696 in our issue tracking system for further analysis. We will look into its details and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

PS: We removed the solution from your first post as it contained your license file. Please do not include your license file in such sample solutions while sharing.

AStelzner · June 28, 2023, 5:45am

Thanks a lot for the information and for the hint concerning the license

Am I getting this right, that a PDF containing text on any page, this PDF can never be a searchable PDF using Aspose.OCR?

Kind regards,
Andy

asad.ali · June 28, 2023, 5:05pm

@AStelzner

Yes, your understandings are correct. However, you can convert such PDF Pages (with mixed content) into images using Aspose.PDF and then perform OCR operation on them using Aspose.OCR.

AStelzner · June 30, 2023, 11:27am

Hello,

ok, this is very tricky, so I have to:

Check wich pages contains text
Extract those pages to an image file
Remember the page number
Replace the pages with text by the extracted images
Save the new pdf
Perform ocr

Thanks, but no

Or is there a easy way to replace PDF pages by an image version?

If not I have to search for another API provider

Thanks,
Andy

dfX3_fdc7b91e-6c57-4128-92c2-5b9fbad02274.png (1.42 KB)

QR8c341bc8-75b7-4e2c-99c3-2d2ba4031ede.png (458 Bytes)

asad.ali · July 1, 2023, 3:52pm

@AStelzner

We are gathering information in order to meet your requirements using minimal steps. We will get back to you shortly.

asad.ali · July 18, 2023, 10:23pm

@AStelzner

In your scenario, EXTRACT PAGES AND REPLACE THEM would not be a perfect way. It would not be suitable to convert every page into image and recognize it. It can be an easy way though. However, we will investigate the feasibility to recognize the PDF with mixed content but we need certain amount of time for it. We will be logging and sharing the ticket ID with you soon on this matter.

Furthermore, the issue about dark background is resolved now.

you can use code like below:

var asposeOcr = new Aspose.OCR.AsposeOcr();

Resolution resolution = new Resolution(300);
PngDevice imageDevice = new PngDevice(resolution);
Document pdfDocument = new Document(inputPdf);

List<Aspose.OCR.RecognitionResult> ocrResults = new List<RecognitionResult>();

for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
{
    using (MemoryStream ms = new MemoryStream())
    {
        // Convert a particular page and save the image to stream
        imageDevice.Process(pdfDocument.Pages[pageCount], ms);

        var ocrInput = new OcrInput(InputType.SingleImage)
                {
                    ms
                };
        asposeOcr.OcrProgress += Api_OcrProgress;

        List<RecognitionResult> recognitionResults = asposeOcr.Recognize(ocrInput, new RecognitionSettings
        {
            Language = Language.Deu,
            DetectAreasMode = DetectAreasMode.PHOTO
        });

        ocrResults.AddRange(recognitionResults);
    }
}
Aspose.OCR.AsposeOcr.SaveMultipageDocument($".\\{Path.GetFileName(inputPdf)}_out.pdf", SaveFormat.PdfNoImg, ocrResults);

it’s a combination of Aspose OCR and Aspose.PDF. Use Aspose PDF to convert pages into images and then recognize images.

Also, you can use Aspose.PDF to extract text. And notice that Aspose.OCR now can’t recognize white text on a dark background. We will add such an ability later.

asad.ali · July 20, 2023, 11:45am

@AStelzner

We will be logging and sharing the ticket ID with you soon on this matter.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRNET-701

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.