Hi Alessandro,
Hi Babar,
Hi Alessandro,
Thank you.
Hi Alessandro,
Hi Alessandro,
This is to inform you that I have performed a few tests to get 100% recognition accuracy from your provided sample. Unfortunately, none of the solutions allowed me to recognize all the text correctly therefore I have logged the incident in our bug tracking system under the ticket OCR-34059 for product team’s review. Please allow us some time to properly analyze this scenario and get back to you with updates in this regard.
Please note, I was able to get maximum accuracy by adding custom recognition blocks on the image. Moreover, the process took less time to complete because OcrEngine does not have to process complete image rather portions of it. Please check the following code snippet for your reference. I also suggest you to use the mechanism to store the pre-processed images so you could tweak the OCR process yourself to get best results.
C#
string text = “”;
//Initialize an instance of OcrEngine
OcrEngine ocrEngine = new OcrEngine();
//Set Image property by loading an image from file path
ocrEngine.Image = ImageStream.FromFile(“D:/samples/sample1.png”);
//Clear recognition blocks
ocrEngine.Config.ClearRecognitionBlocks();
//Add 4 rectangle blocks to user defined recognition blocks
ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(147, 290, 1365, 423));
ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(157, 785, 447, 113));
ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(597, 785, 393, 117));
ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(1051, 785, 431, 111));
//Ignore everything else on the image other than the user defined recognition blocks
ocrEngine.Config.DetectTextRegions = false;
//Process the image
if (ocrEngine.Process())
{
text = text + ocrEngine.Text;
Console.WriteLine(ocrEngine.Text);
}
System.IO.File.WriteAllText(“D:/output.txt”, text);
Hi Babar,
Hi Alessandro,
Thank you Babar,
Hi Babar
was there any progress on this issue? I have an image that does not want to finish at all. But it also does not error.
It is almost like it is finding an infinite number of artifacts the tif to analyse.
In general the OCR works well but there are some images that do not process so well.
The images have been created from pages of a pdf where the page is a paper page scanned to a pdf file by a supplier to us. Hope that makes sense.
Thanks
Ralph Price
Rotorua
New Zealand
Hi Ikram
here is a link to a zip file containing the pdf and the two extracted tif files where the tif files differ in their bit depth. Changing the bit depth did not improve the situation
Tif extracted from pdf using:
// Create Resolution object
Aspose.Pdf.Devices.Resolution resolution = new Aspose.Pdf.Devices.Resolution(150);
// Create TiffSettings object
Aspose.Pdf.Devices.TiffSettings tiffSettings = new TiffSettings();
tiffSettings.Compression = CompressionType.None;
tiffSettings.Depth = Aspose.Pdf.Devices.ColorDepth.Format1bpp;
tiffSettings.Shape = ShapeType.None;
tiffSettings.SkipBlankPages = false;
// Create TIFF device
TiffDevice tiffDevice = new TiffDevice(resolution, tiffSettings);
// Convert a particular page and save the image to stream
tiffDevice.Process(pdfDocument, 1, 1, theImgFilePath);</i>
And OCR done with:
<i>OcrEngine ocr = new OcrEngine();
ocr.Image = ImageStream.FromFile(theImgFilePath);
if (ocr.Process())
{
Console.WriteLine("Text recognized: " + ocr.Text);
File.WriteAllText(theTxtFilePath, ocr.Text.ToString());
}
However strangely today when I run the same code it is not getting stuck on this pdf and is actually processing it at a reasonable speed!! Others that were taking 45 seconds to process are now down to 8 seconds.
I did update my version prior to writing on the forum but it is almost as if the effect of that has only taken effect today despite not having the VS project running when I did the upgrade.
Feel free to have a look at the zip file if you wish any way.
Thanks
Ralph Price
Hi Ikram
here is a current example that is a problem today.
Thanks
Ralph
ocrEngine.ClearNotifies();
ocrEngine.Config.ClearRecognitionBlocks();
ocrEngine.Config.AddRecognitionBlock(RecognitionBlock.CreateTextBlock(1311, 993, 481, 147));
ocrEngine.Config.DetectTextRegions = false;
ocrEngine.Image = ImageStream.FromFile(@"C:\OMR\sample_1.tif");
{
foreach (IRecognizedPartInfo info in ocrEngine.Text.PartsInfo)
{
IRecognizedTextPartInfo textInfo = (IRecognizedTextPartInfo)info;
Console.WriteLine("Block: {0} Text: {1}", info.Box, textInfo.Text);
}
}
Hi Ikram
with sample 2, what time was taken to do the OCR on the image file?
Was the tif created using:
tiffDevice.Process(pdfDocument, iPage, iPage, theImgFilePath);
or
XImage xImage = pdfDocument.Pages[iPage].Resources.Images[iImage];
string theImgFilePath = txtBxDir.Text + @"\extracted" + i.ToString() + “" + attachment.Name.Substring(0, attachment.Name.Length - 4) + "” + iPage.ToString() + “_” + iImage.ToString() + “.jpg”;
FileStream fs = File.Create(theImgFilePath);
xImage.Save(fs);
fs.Close();
Do you have a preference of which method was better?
I am assuming that since the page should be human readable then getting the raw image from the page as compared to the whole page as an image then the first method should not be necessary as far as compression of the image into the pdf page etc.
It is version 2.7.0 that I have installed.
Thank you for the suggestion for setting detection regions and the code for how to do so. Unfortunately this process is targeting a number of .eml (email files) that I want to extract the text from, both ‘raw’ text in the e-mail as well as text in any attachments (with considerably varying layouts) to the emails so I am unable to set any target areas to OCR.
Thanks for your assistance.
Regards
Ralph Price
Aspose.Pdf.Devices.TiffSettings tiffSettings = new Aspose.Pdf.Devices.TiffSettings();
tiffSettings.Compression = Aspose.Pdf.Devices.CompressionType.None;
tiffSettings.Depth = Aspose.Pdf.Devices.ColorDepth.Format1bpp;
tiffSettings.Shape = Aspose.Pdf.Devices.ShapeType.None;
tiffSettings.SkipBlankPages = false;
Aspose.Pdf.Devices.TiffDevice tiffDevice = new Aspose.Pdf.Devices.TiffDevice(resolution, tiffSettings);
Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@"C:\ocr_files\sample_2.pdf");
tiffDevice.Process(pdfDocument, 1, 1, @"C:\ocr_files\sample_2.tiff");
Good morning Ikram
thank you very much for your assistance.
Please feel free to have the Dev team look into these PDF’s as examples
Regards
Ralph Price
The issues you have found earlier (filed as ) have been fixed in this Aspose.Words for JasperReports 18.3 update.