Hi Team,
Need your help on below requirements on OCR of documents --> with samples
- OCR on Scanned PDF
- OCR on Text PDF
- OCR on TIF files
- OCR on Image files (jpg, png, bmp, …)
- OCR on Office files (doc, xls, ppt)
Hi Team,
Need your help on below requirements on OCR of documents --> with samples
Please check the following documentation article(s) to fulfil your requirements:
Please note that reading Text PDF files is not a feature of Aspose.OCR. It is done by Aspose.PDF. Furthermore, regarding DOC/XLS/PPT file(s), do they also contain scanned documents on which you want to perform OCR? OR there is simple text in them?
Thanks for the details and help on below as well.
You can use Aspose.PDF to find whether PDF contains images or text only. In the case of a text-only file, you can extract text from it as well using Aspose.PDF.
A feature request as OCRNET-475 has been logged in our issue tracking system to investigate the feasibility of this requirement. We will investigate whether it is possible to incorporate a feature in the Aspose.OCR to process these formats just like it does with scanned PDF. We will let you know once we have some updates in this regard.
Furthermore, you can however use other Aspose APIs like Aspose.Words, Aspose.Cells and Aspose.Slides to extract content (images/text) from DOCX, XSLX and PPTX files respectively. Later on, you can use Aspose.OCR to perform OCR operation on the extracted images.
In case of any inquiry related to other Aspose APIs, we request you create topics in the respective forum categories where you will be assisted accordingly.
Hi Team,
While performing OCR on pdf having only image & both (image + text), facing issues.
Tested sample files:
Only Image: Scan_0005.pdf
Both Image & Text: Both_Image_Text.pdf
Issue#1: If the pdf is having both image & text with 32 bit C# windows application
OCR content extracted is Empty
Issue#2: If the pdf is having both image & text with 64 bit C# windows application
OCR content extracted is Empty
Issue#3: If the pdf is having only Image with 32 bit C# windows application
While creating searchable pdf --> getting exception:
Exception Message:
The type initializer for ' ’ threw an exception.
Exception StackTrace:
at .()
at .(RecognitionSettings )
at .(RecognitionSettings )
at Aspose.OCR.AsposeOcr.RecognizePdf(String fullPath, DocumentRecognitionSettings settings)
at OCR_Comparison.Aspose_OCR.AsposeScanned2Searchable(String InputFile) in C:\Venkat_Workspace\2022\POC\OCR_Comparison\OCR_Comparison\Aspose_OCR.cs:line 212
Issue#4: If the pdf is having only Image with 64 bit C# windows application
Searching pdf is created “Scan_0005_Aspose_Searchable.pdf” but getting the exception as below
Exception Message:
Unable to cast object of type ‘#=z8fjHnK9lNvVlfUs59m9d5gO9igCvmWnvnQ==’ to type ‘#=zY5jratlnqgJ9det2qmRBM1zmgFdDkcR$FQ==’.
Exception StackTrace:
at #=zvieTOh0QECAwOxvDRRiNIBI9VH8RQu89qFsMQ4ng9qkoB3Ezfg==.#=zBtMiWjqbpiGA()
at Aspose.Pdf.Operators.SelectFont.#=zyx$Q8MY=(#=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operator…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operators.TextOperator…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operators.TextStateOperator…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.Operators.SelectFont…ctor(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at #=zvieTOh0QECAwOxvDRRiNIBI9VH8RQu89qFsMQ4ng9qkoB3Ezfg==.#=zv72B_ME=(Int32 #=zJKxmitk=)
at Aspose.Pdf.Operator.#=zYL$eD2g=(Int32 #=zJKxmitk=, #=zyEnMnypX06BBL63w_ujJZh$jl8Fk2ba8jsulyG16hA0v #=zWCucLwI=)
at Aspose.Pdf.OperatorCollection.#=zqg7mGJrDGTXQ()
at Aspose.Pdf.OperatorCollection.#=zBDTdIOTq8nVw()
at Aspose.Pdf.OperatorCollection.get_Count()
at #=znwrRRAS4Loo9fctuiUfYkaLykl0eArjuSHmAuVOYY3QWM216Lzt$jcU=.#=zRQ_pTSY=()
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS.#=zSoC4PSQmFAVV(BaseOperatorCollection #=zFgq23AE=, Resources #=zVOD9wLg=, Page #=zcMGlH4U=)
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS.#=zSoC4PSQmFAVV(BaseOperatorCollection #=zFgq23AE=, Resources #=zVOD9wLg=)
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS.#=z3I5EC7A=()
at #=zS9uC$025V4XeO$3YtdTDRduAQELNMIrn25GiFTG0oi9d9fkEzRj88cgl04VS…ctor(Page #=zcMGlH4U=, TextSearchOptions #=zZHZX9lmhi3uj, Boolean #=zXsO8JDGYhPdZ)
at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText(Encoding encoding)
at Aspose.Pdf.Facades.PdfExtractor.ExtractText()
at OCR_Comparison.Aspose_OCR.AsposeOcrPDF_Searchable(String InputFile) in C:\Venkat_Workspace\2022\POC\OCR_Comparison\OCR_Comparison\Aspose_OCR.cs:line 282
NOTE: When open the pdf in AcrobatReader showing alert message “An error exists on this page. Acrobat may not dispaly the page correctly…”
Below is the code steps I’m following - please reveiw and suggest
Checking the pdf is having image / Text --> using Aspose.PDF
Aspose.Pdf.License PDFLicense = new Aspose.Pdf.License();
PDFLicense.SetLicense(Licensepath);
MemoryStream ms = new MemoryStream();
PdfExtractor extractor = new PdfExtractor();
extractor.BindPdf(inputFile);
extractor.ExtractText();
extractor.GetText(ms);
bool containsText = ms.Length >= 1;
extractor.ExtractImage();
bool containsImage = extractor.HasNextImage();
if (containsText && !containsImage)
return DocumentType.Text;
else if (!containsText && containsImage)
return DocumentType.Image;
else if (containsText && containsImage)
return DocumentType.Both;
else
return DocumentType.None;
If it is having only Image --> creating searchable pdf
Aspose.OCR.License OCRLicense = new Aspose.OCR.License();
OCRLicense.SetLicense(Licensepath);
ocr = new AsposeOcr();
DocumentRecognitionSettings set = new DocumentRecognitionSettings()
{
StartPage = 0,
PagesNumber = 1
};
List result = ocr.RecognizePdf(InputFile, set);
string SearchablePDF = Path.GetDirectoryName(InputFile) + “\” + Path.GetFileNameWithoutExtension(InputFile) + “_Aspose_Searchable” + Path.GetExtension(InputFile);
AsposeOcr.SaveMultipageDocument(SearchablePDF, SaveFormat.Pdf, result);
If the pdf is having only Text (or) Image (searchable pdf from 2nd step) (or) both image & text --> using Aspose.PDF
Aspose.Pdf.License PDFLicense = new Aspose.Pdf.License();
PDFLicense.SetLicense(Licensepath);
MemoryStream ms = new MemoryStream();
pdfExtractor = new PdfExtractor();
pdfExtractor.BindPdf(InputFile);
pdfExtractor.ExtractTextMode = 1;
// ExtractText
pdfExtractor.ExtractText();
string TXTFIlesFolderPath = AppDomain.CurrentDomain.BaseDirectory + “TxtFiles”;
if (!Directory.Exists(TXTFIlesFolderPath))
Directory.CreateDirectory(TXTFIlesFolderPath);
string sTxtFile = TXTFIlesFolderPath + “\” + Path.GetFileNameWithoutExtension(InputFile) + “_Aspose.txt”;
pdfExtractor.GetText(sTxtFile);
pdfExtractor.Close();
if (File.Exists(sTxtFile))
{
Message = File.ReadAllText(sTxtFile);
Message = Message.Replace("\0", “”).Replace("\r\n", “”);
}Both_Image_Text.pdf (473.1 KB)
Scan_0005.pdf (443.0 KB)
Scan_0005_Aspose_Searchable.pdf (1.5 MB)
We tested the scenario using Aspose.OCR for .NET 22.3 and did not notice the exception that you have mentioned in both cases. However, we did notice that the extracted text content was empty. Can you please again test the case using 22.3 version and let us know if exception is still occurring at your side? We will log the issues accordingly in our issue management system and share the IDs with you.
We are currently using Aspose.Ocr 22.1 & Aspose.Pdf 22.1.
If possible try with this version. We will try with 22.3
Also, please confirm the process which we are following is correct
STEP#1: Checking the pdf is having image / Text --> using Aspose.PDF
STEP#2: If it is having only Image --> creating searchable pdf à using Aspose.Ocr
STEP#3: If the pdf is having only Text (or) Image (searchable pdf from 2nd step) (or) both image & text --> using Aspose.PDF
with Aspose.OCR 22.3 we are not getting alert when opening the pdf.
But the content in the pdf is not searchable
And extracted content is empty.
Add this in our paid support.
Please consider this and address asap. Thanks
For your above step, we need to investigate further if there is a better approach to achieve the expected output. Also, the issue of empty text being returned needs to be investigated. Please note that we recommend you post this inquiry in paid support forum if you already have paid support subscription. You can login into helpdesk using same email address which was used to purchase the paid support subscription. This way your issue will be logged and addressed on priority and urgent basis.