OCR operation is slow

itconsult.developer · May 15, 2015, 2:18am

Hello, me too I'm testing the OCR library.

And me too I noticed long times... (I've used a 300 dpi tiff).

But, the question is: the trial version intentionally adds some delays during the elaboration?

Thank you,

babar.raza · May 15, 2015, 3:37am

Hi Alessandro,

Thank you for using Aspose APIs.

I don’t think the delay is caused due to evaluation limitations, you may confirm this by requesting a 30 day temporary license. The delay in processing could be due to the large dimensions of the input image and/or the complexity of the textual blocks on the image. We can thoroughly investigate the matter on our end if you can provide your sample along with your environment details. Please execute a simple test and record the time OcrEngine spends on Process method. We will use those values as benchmark while performing the similar tests on our side.

Please note, we have split the existing thread to create a new one on your behalf so we could treat your request individually.

itconsult.developer · May 18, 2015, 4:23am

Hi Babar,

Thank you.

We will buy the full license, because we are interested in other libraries.

Anyway, I created a test program and I tried with some images with no good results…

Please see the attached sample images and project.

Regards

babar.raza · May 18, 2015, 11:50am

Hi Alessandro,

Thank you for sharing the samples.

I have evaluated the presented scenario while using the latest version of Aspose.OCR for .NET 2.5.0, and was able to notice the performance lag. Please note, the sample in PNG format proved to be most efficient however, it took almost 7 minutes to complete the OCR operation on my machine having core i7 1st generation with 6GB of RAM. I have logged this incident in our bug tracking system under the ticket OCR-34056 for further investigation.

Regarding the OCR results, I am unable to get the 100% accuracy but the results are better with few spelling mistakes and one unrecognizable textual block. I will try to tweak the process to get better results, and share my findings here.

itconsult.developer · May 19, 2015, 1:47am

Thank you.

I will wait for the news.

Please notice that, yesterday, we bought the license for the Aspose.Total for .NET.

Regards.

babar.raza · May 19, 2015, 12:14pm

Hi Alessandro,

It is good to know that you’re on board with us now. Please note, the ticket logged earlier as OCR-34056 is currently pending for analysis. I have requested the product team to schedule it for analysis at earliest possible, As soon as we have completed the preliminary investigation, we will share the results here along with possible schedule for the fix.

babar.raza · May 20, 2015, 5:03am

Hi Alessandro,

This is to inform you that I have performed a few tests to get 100% recognition accuracy from your provided sample. Unfortunately, none of the solutions allowed me to recognize all the text correctly therefore I have logged the incident in our bug tracking system under the ticket OCR-34059 for product team’s review. Please allow us some time to properly analyze this scenario and get back to you with updates in this regard.

Please note, I was able to get maximum accuracy by adding custom recognition blocks on the image. Moreover, the process took less time to complete because OcrEngine does not have to process complete image rather portions of it. Please check the following code snippet for your reference. I also suggest you to use the mechanism to store the pre-processed images so you could tweak the OCR process yourself to get best results.

C#

string text = “”;
//Initialize an instance of OcrEngine
OcrEngine ocrEngine = new OcrEngine();

//Set Image property by loading an image from file path
ocrEngine.Image = ImageStream.FromFile(“D:/samples/sample1.png”);
//Clear recognition blocks
ocrEngine.Config.ClearRecognitionBlocks();

//Add 4 rectangle blocks to user defined recognition blocks
ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(147, 290, 1365, 423));
ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(157, 785, 447, 113));
ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(597, 785, 393, 117));
ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(1051, 785, 431, 111));

//Ignore everything else on the image other than the user defined recognition blocks
ocrEngine.Config.DetectTextRegions = false;

//Process the image
if (ocrEngine.Process())
{
text = text + ocrEngine.Text;
Console.WriteLine(ocrEngine.Text);
}
System.IO.File.WriteAllText(“D:/output.txt”, text);

itconsult.developer · May 20, 2015, 10:03am

Hi Babar,

thank you for your support.

My goal is to find the text “Riservato”, placed into the top right corner.

Using a single little block like the following I can reach my goal: the text is found.

But, the process needs 25 about seconds… And it seems to me too much…

ocrEngine.Config.ClearRecognitionBlocks()

ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(1360, 85, 150, 60))

ocrEngine.Config.DetectTextRegions = False

Thank you anyway.

Regards.

babar.raza · May 21, 2015, 12:51am

Hi Alessandro,

We have already logged your performance related concerns under the ticket OCR-34056. Please let us analyze it first, if found appropriate we will log a separate ticket for your recent concerns. I also believe that upon resolving the aforesaid ticket, the overall performance of the API will improve, and you will be able to get the results quickly with custom recognition blocks as well.

Anyway, we will discuss the recent concerns with the product team, and keep you posted with updates in this regard.

itconsult.developer · May 21, 2015, 1:48am

Thank you Babar,

Regards.

GIS_Analyst · August 2, 2015, 10:54pm

Hi Babar
was there any progress on this issue? I have an image that does not want to finish at all. But it also does not error.
It is almost like it is finding an infinite number of artifacts the tif to analyse.

In general the OCR works well but there are some images that do not process so well.
The images have been created from pages of a pdf where the page is a paper page scanned to a pdf file by a supplier to us. Hope that makes sense.

Thanks

Ralph Price
Rotorua
New Zealand

ikram.haq · August 3, 2015, 4:01am

Hi Ralph,

Thank you for your inquiry.

Yes, our product team is making progress in this regard. They are working on improving the Text Recognition Block algorithm and working on time taken by the OCR Engine. For your current issue, we request you to please forward us the image that you have extracted and the source PDF file so that we could test it at our end and come up with appropriate solution.

Hope the above information helps. In case of any issues, need further clearance please be sure to let us know, we will be glad to assist you.

GIS_Analyst · August 3, 2015, 3:36pm

Hi Ikram
here is a link to a zip file containing the pdf and the two extracted tif files where the tif files differ in their bit depth. Changing the bit depth did not improve the situation

Tif extracted from pdf using:

// Create Resolution object
Aspose.Pdf.Devices.Resolution resolution = new Aspose.Pdf.Devices.Resolution(150);

// Create TiffSettings object
Aspose.Pdf.Devices.TiffSettings tiffSettings = new TiffSettings();
tiffSettings.Compression = CompressionType.None;
tiffSettings.Depth = Aspose.Pdf.Devices.ColorDepth.Format1bpp;
tiffSettings.Shape = ShapeType.None;
tiffSettings.SkipBlankPages = false;

// Create TIFF device
TiffDevice tiffDevice = new TiffDevice(resolution, tiffSettings);

// Convert a particular page and save the image to stream
tiffDevice.Process(pdfDocument, 1, 1, theImgFilePath);</i>

And OCR done with:

<i>OcrEngine ocr = new OcrEngine();
ocr.Image = ImageStream.FromFile(theImgFilePath);
if (ocr.Process())
{
Console.WriteLine("Text recognized: " + ocr.Text);
File.WriteAllText(theTxtFilePath, ocr.Text.ToString());
}

However strangely today when I run the same code it is not getting stuck on this pdf and is actually processing it at a reasonable speed!! Others that were taking 45 seconds to process are now down to 8 seconds.
I did update my version prior to writing on the forum but it is almost as if the effect of that has only taken effect today despite not having the VS project running when I did the upgrade.

Feel free to have a look at the zip file if you wish any way.

Thanks

Ralph Price

GIS_Analyst · August 3, 2015, 8:43pm

Hi Ikram

here is a current example that is a problem today.

Thanks

Ralph

ikram.haq · August 4, 2015, 10:48am

Hi Ralph,

Thank you for writing us back along with sample files.

We have evaluated the presented scenario at our end using the samples provided by you. We have extracted the images from PDFs with 300 dpi each and then perform OCR. It is generating acceptable results although the execution time is bit high. The PDFs, extracted images and text output have been attached with this post for your reference.

Further, we have used the latest version of Aspose.OCR for .Net 2.7.0. If you intend to get some specific contents from a portion of the image, you can use the custom recognition blocks to get better accuracy. Please note, the above mentioned solution is useful in scenario where you have documents following the similar structure, that is; the contents to be scanned are always on the same location for each image.

Following is the sample code for your reference:

OcrEngine ocrEngine = new OcrEngine();
ocrEngine.ClearNotifies();

ocrEngine.Config.ClearRecognitionBlocks();
ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(1311, 993, 481, 147));
ocrEngine.Config.DetectTextRegions = false;
ocrEngine.Image = ImageStream.FromFile(@"C:\OMR\sample_1.tif");

if (ocrEngine.Process())
{
foreach (IRecognizedPartInfo info in ocrEngine.Text.PartsInfo)
{
IRecognizedTextPartInfo textInfo = (IRecognizedTextPartInfo)info;
Console.WriteLine("Block: {0} Text: {1}", info.Box, textInfo.Text);
}
}

Output:

File Name : Sample_1.tif

Subrotal 478,26

GST 71.74

Ameunt 550.90

Hope the above information helps. In case of any issues, need further clearance please be sure to let us know, we will be glad to assist you.

GIS_Analyst · August 4, 2015, 4:36pm

Hi Ikram

with sample 2, what time was taken to do the OCR on the image file?

Was the tif created using:

tiffDevice.Process(pdfDocument, iPage, iPage, theImgFilePath);

or

XImage xImage = pdfDocument.Pages[iPage].Resources.Images[iImage];
string theImgFilePath = txtBxDir.Text + @"\extracted" + i.ToString() + “" + attachment.Name.Substring(0, attachment.Name.Length - 4) + "” + iPage.ToString() + “_” + iImage.ToString() + “.jpg”;
FileStream fs = File.Create(theImgFilePath);
xImage.Save(fs);
fs.Close();

Do you have a preference of which method was better?
I am assuming that since the page should be human readable then getting the raw image from the page as compared to the whole page as an image then the first method should not be necessary as far as compression of the image into the pdf page etc.

It is version 2.7.0 that I have installed.

Thank you for the suggestion for setting detection regions and the code for how to do so. Unfortunately this process is targeting a number of .eml (email files) that I want to extract the text from, both ‘raw’ text in the e-mail as well as text in any attachments (with considerably varying layouts) to the emails so I am unable to set any target areas to OCR.

Thanks for your assistance.

Regards

Ralph Price

ikram.haq · August 5, 2015, 8:01am

Hi Ralph,

Thank you for writing us back.

1. It took hardly a 12 seconds to perform OCR on sample 2 i.e. sample_2.tiff. The TIFF file was created using the code below:

Aspose.Pdf.Devices.Resolution resolution = new Aspose.Pdf.Devices.Resolution(300);
Aspose.Pdf.Devices.TiffSettings tiffSettings = new Aspose.Pdf.Devices.TiffSettings();
tiffSettings.Compression = Aspose.Pdf.Devices.CompressionType.None;
tiffSettings.Depth = Aspose.Pdf.Devices.ColorDepth.Format1bpp;
tiffSettings.Shape = Aspose.Pdf.Devices.ShapeType.None;
tiffSettings.SkipBlankPages = false;
Aspose.Pdf.Devices.TiffDevice tiffDevice = new Aspose.Pdf.Devices.TiffDevice(resolution, tiffSettings);
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@"C:\ocr_files\sample_2.pdf");
tiffDevice.Process(pdfDocument, 1, 1, @"C:\ocr_files\sample_2.tiff");

2. Purpose of both of the codes is different. Code 1 using TiffDevice is used to convert PDF file to TIFF image. Whereas Code 2 using XImage only extract images if they exist in PDF. In this particular case we are trying to convert PDF document to a TIFF image so go for Code 1.

Further, I am adding the samples provided by you under the ticket OCR-34059 so that our product team also took into it.

Hope the above information helps. In case of any issues, need further clearance please be sure to let us know, we will be glad to assist you.

GIS_Analyst · August 6, 2015, 3:00pm

Good morning Ikram

thank you very much for your assistance.
Please feel free to have the Dev team look into these PDF’s as examples

Regards

Ralph Price

ikram.haq · August 11, 2015, 12:40am

Hi Ralph,

The PDFs and the generated TIFFs have been forwarded to the product team along with detailed information. Our product team will look into the details and we will keep you updated on the status.

In the mean while you face any issues please be sure to let us know, we will be glad to assist you.

awais.hafeez · March 29, 2018, 5:23am

The issues you have found earlier (filed as ) have been fixed in this Aspose.Words for JasperReports 18.3 update.