Aspose ocr for .net on different files

venkatmallu · February 8, 2022, 11:21pm

Hi Team,
Need your help on below requirements on OCR of documents --> with samples

OCR on Scanned PDF
OCR on Text PDF
OCR on TIF files
OCR on Image files (jpg, png, bmp, …)
OCR on Office files (doc, xls, ppt)

asad.ali · February 9, 2022, 3:04pm

Please check the following documentation article(s) to fulfil your requirements:

Recognize scanned PDF

Please note that reading Text PDF files is not a feature of Aspose.OCR. It is done by Aspose.PDF. Furthermore, regarding DOC/XLS/PPT file(s), do they also contain scanned documents on which you want to perform OCR? OR there is simple text in them?

venkatmallu · February 9, 2022, 3:54pm

Thanks for the details and help on below as well.

How to identify the PDF is Scanned PDF / TEXT PDF based on that we can use Aspose.PDF
Sample link on Aspose.PDF to ocr the TEXT PDF
DOCX, XLSX, PPTX may contain only text / only images / both text and images --> any options to extract the content on this combination with Aspose

asad.ali · February 9, 2022, 9:13pm

@venkatmallu

You can use Aspose.PDF to find whether PDF contains images or text only. In the case of a text-only file, you can extract text from it as well using Aspose.PDF.

A feature request as OCRNET-475 has been logged in our issue tracking system to investigate the feasibility of this requirement. We will investigate whether it is possible to incorporate a feature in the Aspose.OCR to process these formats just like it does with scanned PDF. We will let you know once we have some updates in this regard.

Furthermore, you can however use other Aspose APIs like Aspose.Words, Aspose.Cells and Aspose.Slides to extract content (images/text) from DOCX, XSLX and PPTX files respectively. Later on, you can use Aspose.OCR to perform OCR operation on the extracted images.

In case of any inquiry related to other Aspose APIs, we request you create topics in the respective forum categories where you will be assisted accordingly.

venkatmallu · April 15, 2022, 2:49pm

Hi Team,
While performing OCR on pdf having only image & both (image + text), facing issues.
Tested sample files:
Only Image: Scan_0005.pdf
Both Image & Text: Both_Image_Text.pdf

Issue#1: If the pdf is having both image & text with 32 bit C# windows application
OCR content extracted is Empty

Issue#2: If the pdf is having both image & text with 64 bit C# windows application
OCR content extracted is Empty

Issue#3: If the pdf is having only Image with 32 bit C# windows application
While creating searchable pdf --> getting exception:
Exception Message:
The type initializer for ' ’ threw an exception.

Exception StackTrace:
at .()
at .(RecognitionSettings )
at .(RecognitionSettings )
at Aspose.OCR.AsposeOcr.RecognizePdf(String fullPath, DocumentRecognitionSettings settings)
at OCR_Comparison.Aspose_OCR.AsposeScanned2Searchable(String InputFile) in C:\Venkat_Workspace\2022\POC\OCR_Comparison\OCR_Comparison\Aspose_OCR.cs:line 212

Issue#4: If the pdf is having only Image with 64 bit C# windows application
Searching pdf is created “Scan_0005_Aspose_Searchable.pdf” but getting the exception as below
Exception Message:
Unable to cast object of type ‘#=z8fjHnK9lNvVlfUs59m9d5gO9igCvmWnvnQ==’ to type ‘#=zY5jratlnqgJ9det2qmRBM1zmgFdDkcR$FQ==’.

Exception StackTrace:
at #=zvieTOh0QECAwOxvDRRiNIBI9VH8RQu89qFsMQ4ng9qkoB3Ezfg==.#=zBtMiWjqbpiGA()
at Aspose.Pdf.Operators.SelectFont.#=zyx$Q8MY=(#=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operator…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operators.TextOperator…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operators.TextStateOperator…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operators.SelectFont…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at #=zvieTOh0QECAwOxvDRRiNIBI9VH8RQu89qFsMQ4ng9qkoB3Ezfg==.#=zv72B_ME=(Int32 #=zJKxmitk=)
at Aspose.Pdf.Operator.#=zYL$eD2g=(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.OperatorCollection.#=zqg7mGJrDGTXQ()
at Aspose.Pdf.OperatorCollection.#=zBDTdIOTq8nVw()
at Aspose.Pdf.OperatorCollection.get_Count()
at #=znwrRRAS4Loo9fctuiUfYkaLykl0eArjuSHmAuVOYY3QWM216Lzt$jcU=.#=zRQ_pTSY=()
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS.#=zSoC4PSQmFAVV(BaseOperatorCollection #=zFgq23AE=, Resources #=zVOD9wLg=, Page #=zcMGlH4U=)
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS.#=zSoC4PSQmFAVV(BaseOperatorCollection #=zFgq23AE=, Resources #=zVOD9wLg=)
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS.#=z3I5EC7A=()
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS…ctor(Page #=zcMGlH4U=, TextSearchOptions #=zZHZX9lmhi3uj, Boolean #=zXsO8JDGYhPdZ)
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText(Encoding encoding)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText()
at OCR_Comparison.Aspose_OCR.AsposeOcrPDF_Searchable(String InputFile) in C:\Venkat_Workspace\2022\POC\OCR_Comparison\OCR_Comparison\Aspose_OCR.cs:line 282

NOTE: When open the pdf in AcrobatReader showing alert message “An error exists on this page. Acrobat may not dispaly the page correctly…”

Below is the code steps I’m following - please reveiw and suggest

Checking the pdf is having image / Text --> using Aspose.PDF
Aspose.Pdf.License PDFLicense = new Aspose.Pdf.License();
PDFLicense.SetLicense(Licensepath);
MemoryStream ms = new MemoryStream();
PdfExtractor extractor = new PdfExtractor();
extractor.BindPdf(inputFile);
extractor.ExtractText();
extractor.GetText(ms);
bool containsText = ms.Length >= 1;
extractor.ExtractImage();
bool containsImage = extractor.HasNextImage();
if (containsText && !containsImage)
return DocumentType.Text;
else if (!containsText && containsImage)
return DocumentType.Image;
else if (containsText && containsImage)
return DocumentType.Both;
else
return DocumentType.None;
If it is having only Image --> creating searchable pdf

Aspose.OCR.License OCRLicense = new Aspose.OCR.License();
OCRLicense.SetLicense(Licensepath);

ocr = new AsposeOcr();
DocumentRecognitionSettings set = new DocumentRecognitionSettings()
{
StartPage = 0,
PagesNumber = 1
};
List result = ocr.RecognizePdf(InputFile, set);

string SearchablePDF = Path.GetDirectoryName(InputFile) + “\” + Path.GetFileNameWithoutExtension(InputFile) + “_Aspose_Searchable” + Path.GetExtension(InputFile);
AsposeOcr.SaveMultipageDocument(SearchablePDF, SaveFormat.Pdf, result);
If the pdf is having only Text (or) Image (searchable pdf from 2nd step) (or) both image & text --> using Aspose.PDF

Aspose.Pdf.License PDFLicense = new Aspose.Pdf.License();
PDFLicense.SetLicense(Licensepath);

MemoryStream ms = new MemoryStream();
pdfExtractor = new PdfExtractor();

pdfExtractor.BindPdf(InputFile);

pdfExtractor.ExtractTextMode = 1;
// ExtractText
pdfExtractor.ExtractText();

string TXTFIlesFolderPath = AppDomain.CurrentDomain.BaseDirectory + “TxtFiles”;

if (!Directory.Exists(TXTFIlesFolderPath))
Directory.CreateDirectory(TXTFIlesFolderPath);
string sTxtFile = TXTFIlesFolderPath + “\” + Path.GetFileNameWithoutExtension(InputFile) + “_Aspose.txt”;

pdfExtractor.GetText(sTxtFile);
pdfExtractor.Close();
if (File.Exists(sTxtFile))
{
Message = File.ReadAllText(sTxtFile);
Message = Message.Replace("\0", “”).Replace("\r\n", “”);

}Both_Image_Text.pdf (473.1 KB)
Scan_0005.pdf (443.0 KB)
Scan_0005_Aspose_Searchable.pdf (1.5 MB)

asad.ali · April 15, 2022, 8:13pm

@venkatmallu

We tested the scenario using Aspose.OCR for .NET 22.3 and did not notice the exception that you have mentioned in both cases. However, we did notice that the extracted text content was empty. Can you please again test the case using 22.3 version and let us know if exception is still occurring at your side? We will log the issues accordingly in our issue management system and share the IDs with you.

venkatmallu · April 15, 2022, 8:57pm

We are currently using Aspose.Ocr 22.1 & Aspose.Pdf 22.1.

If possible try with this version. We will try with 22.3

Also, please confirm the process which we are following is correct

STEP#1: Checking the pdf is having image / Text --> using Aspose.PDF

STEP#2: If it is having only Image --> creating searchable pdf à using Aspose.Ocr

STEP#3: If the pdf is having only Text (or) Image (searchable pdf from 2nd step) (or) both image & text --> using Aspose.PDF

venkatmallu · April 15, 2022, 9:15pm

with Aspose.OCR 22.3 we are not getting alert when opening the pdf.
But the content in the pdf is not searchable
And extracted content is empty.
Add this in our paid support.
Please consider this and address asap. Thanks

asad.ali · April 15, 2022, 9:46pm

@venkatmallu

For your above step, we need to investigate further if there is a better approach to achieve the expected output. Also, the issue of empty text being returned needs to be investigated. Please note that we recommend you post this inquiry in paid support forum if you already have paid support subscription. You can login into helpdesk using same email address which was used to purchase the paid support subscription. This way your issue will be logged and addressed on priority and urgent basis.