Aspose ocr for .net on different files

Hi Team,
Need your help on below requirements on OCR of documents --> with samples

  1. OCR on Scanned PDF
  2. OCR on Text PDF
  3. OCR on TIF files
  4. OCR on Image files (jpg, png, bmp, …)
  5. OCR on Office files (doc, xls, ppt)

@venkatmallu

Please check the following documentation article(s) to fulfil your requirements:

Please note that reading Text PDF files is not a feature of Aspose.OCR. It is done by Aspose.PDF. Furthermore, regarding DOC/XLS/PPT file(s), do they also contain scanned documents on which you want to perform OCR? OR there is simple text in them?

Thanks for the details and help on below as well.

  1. How to identify the PDF is Scanned PDF / TEXT PDF based on that we can use Aspose.PDF
  2. Sample link on Aspose.PDF to ocr the TEXT PDF
  3. DOCX, XLSX, PPTX may contain only text / only images / both text and images --> any options to extract the content on this combination with Aspose

@venkatmallu

You can use Aspose.PDF to find whether PDF contains images or text only. In the case of a text-only file, you can extract text from it as well using Aspose.PDF.

A feature request as OCRNET-475 has been logged in our issue tracking system to investigate the feasibility of this requirement. We will investigate whether it is possible to incorporate a feature in the Aspose.OCR to process these formats just like it does with scanned PDF. We will let you know once we have some updates in this regard.

Furthermore, you can however use other Aspose APIs like Aspose.Words, Aspose.Cells and Aspose.Slides to extract content (images/text) from DOCX, XSLX and PPTX files respectively. Later on, you can use Aspose.OCR to perform OCR operation on the extracted images.

In case of any inquiry related to other Aspose APIs, we request you create topics in the respective forum categories where you will be assisted accordingly.

Hi Team,
While performing OCR on pdf having only image & both (image + text), facing issues.
Tested sample files:
Only Image: Scan_0005.pdf
Both Image & Text: Both_Image_Text.pdf

Issue#1: If the pdf is having both image & text with 32 bit C# windows application
OCR content extracted is Empty

Issue#2: If the pdf is having both image & text with 64 bit C# windows application
OCR content extracted is Empty

Issue#3: If the pdf is having only Image with 32 bit C# windows application
While creating searchable pdf --> getting exception:
Exception Message:
The type initializer for ' ’ threw an exception.

Exception StackTrace:
at .()
at .(RecognitionSettings )
at .(RecognitionSettings )
at Aspose.OCR.AsposeOcr.RecognizePdf(String fullPath, DocumentRecognitionSettings settings)
at OCR_Comparison.Aspose_OCR.AsposeScanned2Searchable(String InputFile) in C:\Venkat_Workspace\2022\POC\OCR_Comparison\OCR_Comparison\Aspose_OCR.cs:line 212

Issue#4: If the pdf is having only Image with 64 bit C# windows application
Searching pdf is created “Scan_0005_Aspose_Searchable.pdf” but getting the exception as below
Exception Message:
Unable to cast object of type ‘#=z8fjHnK9lNvVlfUs59m9d5gO9igCvmWnvnQ==’ to type ‘#=zY5jratlnqgJ9det2qmRBM1zmgFdDkcR$FQ==’.

Exception StackTrace:
at #=zvieTOh0QECAwOxvDRRiNIBI9VH8RQu89qFsMQ4ng9qkoB3Ezfg==.#=zBtMiWjqbpiGA()
at Aspose.Pdf.Operators.SelectFont.#=zyx$Q8MY=(#=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operator…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operators.TextOperator…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operators.TextStateOperator…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operators.SelectFont…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at #=zvieTOh0QECAwOxvDRRiNIBI9VH8RQu89qFsMQ4ng9qkoB3Ezfg==.#=zv72B_ME=(Int32 #=zJKxmitk=)
at Aspose.Pdf.Operator.#=zYL$eD2g=(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.OperatorCollection.#=zqg7mGJrDGTXQ()
at Aspose.Pdf.OperatorCollection.#=zBDTdIOTq8nVw()
at Aspose.Pdf.OperatorCollection.get_Count()
at #=znwrRRAS4Loo9fctuiUfYkaLykl0eArjuSHmAuVOYY3QWM216Lzt$jcU=.#=zRQ_pTSY=()
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS.#=zSoC4PSQmFAVV(BaseOperatorCollection #=zFgq23AE=, Resources #=zVOD9wLg=, Page #=zcMGlH4U=)
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS.#=zSoC4PSQmFAVV(BaseOperatorCollection #=zFgq23AE=, Resources #=zVOD9wLg=)
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS.#=z3I5EC7A=()
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS…ctor(Page #=zcMGlH4U=, TextSearchOptions #=zZHZX9lmhi3uj, Boolean #=zXsO8JDGYhPdZ)
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText(Encoding encoding)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText()
at OCR_Comparison.Aspose_OCR.AsposeOcrPDF_Searchable(String InputFile) in C:\Venkat_Workspace\2022\POC\OCR_Comparison\OCR_Comparison\Aspose_OCR.cs:line 282

NOTE: When open the pdf in AcrobatReader showing alert message “An error exists on this page. Acrobat may not dispaly the page correctly…”

Below is the code steps I’m following - please reveiw and suggest

  1. Checking the pdf is having image / Text --> using Aspose.PDF
    Aspose.Pdf.License PDFLicense = new Aspose.Pdf.License();
    PDFLicense.SetLicense(Licensepath);
    MemoryStream ms = new MemoryStream();
    PdfExtractor extractor = new PdfExtractor();
    extractor.BindPdf(inputFile);
    extractor.ExtractText();
    extractor.GetText(ms);
    bool containsText = ms.Length >= 1;
    extractor.ExtractImage();
    bool containsImage = extractor.HasNextImage();
    if (containsText && !containsImage)
    return DocumentType.Text;
    else if (!containsText && containsImage)
    return DocumentType.Image;
    else if (containsText && containsImage)
    return DocumentType.Both;
    else
    return DocumentType.None;

  2. If it is having only Image --> creating searchable pdf

    Aspose.OCR.License OCRLicense = new Aspose.OCR.License();
    OCRLicense.SetLicense(Licensepath);

    ocr = new AsposeOcr();
    DocumentRecognitionSettings set = new DocumentRecognitionSettings()
    {
    StartPage = 0,
    PagesNumber = 1
    };
    List result = ocr.RecognizePdf(InputFile, set);

    string SearchablePDF = Path.GetDirectoryName(InputFile) + “\” + Path.GetFileNameWithoutExtension(InputFile) + “_Aspose_Searchable” + Path.GetExtension(InputFile);
    AsposeOcr.SaveMultipageDocument(SearchablePDF, SaveFormat.Pdf, result);

  3. If the pdf is having only Text (or) Image (searchable pdf from 2nd step) (or) both image & text --> using Aspose.PDF

    Aspose.Pdf.License PDFLicense = new Aspose.Pdf.License();
    PDFLicense.SetLicense(Licensepath);

    MemoryStream ms = new MemoryStream();
    pdfExtractor = new PdfExtractor();

    pdfExtractor.BindPdf(InputFile);

    pdfExtractor.ExtractTextMode = 1;
    // ExtractText
    pdfExtractor.ExtractText();

    string TXTFIlesFolderPath = AppDomain.CurrentDomain.BaseDirectory + “TxtFiles”;

    if (!Directory.Exists(TXTFIlesFolderPath))
    Directory.CreateDirectory(TXTFIlesFolderPath);
    string sTxtFile = TXTFIlesFolderPath + “\” + Path.GetFileNameWithoutExtension(InputFile) + “_Aspose.txt”;

    pdfExtractor.GetText(sTxtFile);
    pdfExtractor.Close();
    if (File.Exists(sTxtFile))
    {
    Message = File.ReadAllText(sTxtFile);
    Message = Message.Replace("\0", “”).Replace("\r\n", “”);

    }Both_Image_Text.pdf (473.1 KB)
    Scan_0005.pdf (443.0 KB)
    Scan_0005_Aspose_Searchable.pdf (1.5 MB)

@venkatmallu

We tested the scenario using Aspose.OCR for .NET 22.3 and did not notice the exception that you have mentioned in both cases. However, we did notice that the extracted text content was empty. Can you please again test the case using 22.3 version and let us know if exception is still occurring at your side? We will log the issues accordingly in our issue management system and share the IDs with you.

We are currently using Aspose.Ocr 22.1 & Aspose.Pdf 22.1.

If possible try with this version. We will try with 22.3

Also, please confirm the process which we are following is correct

STEP#1: Checking the pdf is having image / Text --> using Aspose.PDF

STEP#2: If it is having only Image --> creating searchable pdf à using Aspose.Ocr

STEP#3: If the pdf is having only Text (or) Image (searchable pdf from 2nd step) (or) both image & text --> using Aspose.PDF

with Aspose.OCR 22.3 we are not getting alert when opening the pdf.
But the content in the pdf is not searchable
And extracted content is empty.
Add this in our paid support.
Please consider this and address asap. Thanks

@venkatmallu

For your above step, we need to investigate further if there is a better approach to achieve the expected output. Also, the issue of empty text being returned needs to be investigated. Please note that we recommend you post this inquiry in paid support forum if you already have paid support subscription. You can login into helpdesk using same email address which was used to purchase the paid support subscription. This way your issue will be logged and addressed on priority and urgent basis.