Hi there!
We use Aspose.OCR in conjunction with Aspose.Pdf to convert a TIF file into a searchable PDF. Attached you will find an example of the TIF files.
We are a bit disappointed with the OCR quality, especially compared to the OCR quality of (the free) Tesseract. Attached you will find two PDFs, one created by Aspose and the other by Tesseract, so that you can assess the differences firsthand.
An indictation of how we use Aspose is as follows.
For the attached examples, we’ve used the German language model.
Settings = new RecognitionSettings();
Settings.Language = Aspose.OCR.Language.Deu
public MemoryStream RecognizeImage(MemoryStream image)
{
var pdf = new MemoryStream();
var result = engine.RecognizeImage(image, Settings);
result.Save(pdf, SaveFormat.Pdf);
return pdf;
}
Can you say something about the differences? Do you utilize Tesseract under the hood yourselves?
Input and output.zip (2.4 MB)
@frimbingpickering
We have opened an investigation ticket as below in our issue management system to further analyze this case.
Issue ID(s): OCRNET-664
We will definitely look into its details and keep you posted with the status of ticket resolution. Please be patient and spare us some time.
We are sorry for the inconvenience.
@asad.ali, thank you. Any news yet? Can you give us an indication of when we will receive a response?
@frimbingpickering
The ticket is sadly not investigated yet. We will investigate it on a first come first serve basis and as soon as we make some progress towards its resolution, we will update you in this forum thread. Please spare us some time.
We apologize for your inconvenience.
Okay, understood, we’ll wait and see.
Maybe while you investigate this, you can answer the following question already:
Or do you use an engine of your own? And if so, is it based on Tesseract?
Thank you!
@frimbingpickering
No, we don’t use Tesseract in any way. We use our own Neural network. Aspose.OCR Cloud has been using models from our own development and we use them in downloadable versions too. We use these models for all below features:
- recognize text
- areas with text detection
- dewarping curved text, for clear noises
- cirrilic, latin, hindi and chinese model
@frimbingpickering
Would you please try to use this mode:
OcrInput input = new OcrInput(InputType.TIFF);
input.Add(imgPath);
var result = api.Recognize(input, new RecognitionSettings
{
DetectAreasMode = DetectAreasMode.PHOTO
});
Console.WriteLine(result[0].RecognitionText);
result[0].Save("res664.pdf", SaveFormat.Pdf);
The results: compare.zip (923.8 KB)
Sure, that looks better!
But why? According to the docs, DetectAreasMode.PHOTO
is “better for image with a lot of pictures and other not text objects” and I think that this example image is not such an image.
How do I know which DetectAreasMode
to choose? How can I advise my customers on this?
@frimbingpickering
We have recorded your feedback below the logged ticket and will let you know after performing investigation against it. Please spare us little time.
@frimbingpickering
For images with tables and lines, we advise to use PHOTO or TABLE mode. For garbled text lines (if the lines of text curl), we advise to use CURVED_TEXT.
DOCUMENT mode helpful only with structured text such as scanned contracts, book pages, etc.