DocumentRecognitionSettings DRS = new DocumentRecognitionSettings()
{
PagesNumber = 0,
PreprocessingFilters = filter,
//// allowed options
AllowedCharacters = CharactersAllowedType.ALL, // ignore not latin symbols
// AutoContrast = false, // use Contrast correction filter before recognition - good for images with noice
AutoSkew = false, // switch off if your image not rotated
DetectAreas = true, // switch off if your image has a simple document structure (one column text without pictures)
DetectAreasMode = DetectAreasMode.COMBINE,// depends on the structure of your image
IgnoredCharacters = "", // define the symbols you want to ignore in the recognition result
Language = Language.Eng, // we support 26 languages
//ThreadsCount = 15,
LinesFiltration = true, // this works slowly, so choose it only if your picture has lines and it they bad detected in TABLE ar DOCUMENT DetectAreasMode
// ThreadsCount = 1, // by default our API use all you threads. But you can run it in one thread. Simply set up this here
// ThresholdValue = 150 // if you want to binarize image with your own threashold value, you can set up this here (from 1 to 255)
};
Would you please share the complete code snippet with your sample PDF document? We will test the scenario in our environment and address it accordingly.
We generated a searchable PDF in our environment using below code snippet and observed the issue for which a screenshot is attached. Could you please confirm if same is the issue you are noticing?
// C# Code
try
{
var api = new OCR.AsposeOcr();
var settings = new OCR.DocumentRecognitionSettings();
settings.StartPage = 0;
settings.PagesNumber = 6;
//settings.LinesFiltration = true;
settings.DetectAreas = true;
settings.DetectAreasMode = OCR.DetectAreasMode.COMBINE;
settings.ThreadsCount = 1;
var res = api.RecognizePdf(dataDir + "A500008704_20220128_100605.tif_Searchable.pdf", settings);
OCR.AsposeOcr.SaveMultipageDocument(dataDir + "File1_OCRd.pdf", OCR.SaveFormat.Pdf, res);
}
catch (Exception ex)
{
throw ex;
}
Yes , I am facing the similar issue, In my case too the search shadow is on same line but far behind from the exact word. See the attached sample for reference
==> I am looking for the word “this”… It has 3 occurrence but it show the search few letter back
Also I noticed in your code you did
Image to PDF in 1 step
I am doing Image to PDF and then PDF to searchable pdf.
Yes, the issue is same at our end. It is being caused due to incorrect font size of text API places on the image inside PDF. We are trying to rectify this issue and improve this feature in upcoming release i.e. 22.9.
Aspose.OCR initially used to provide features of recognizing images and saving the results into different file formats like PDF, OCR, etc. Creating searchable PDF was later implemented to perform OCR operation on scanned PDF documents. You can use any way that suits your requirements and use cases.
Thank you for the update. Also today when I uploaded the same image on aspose.com Text to PDF Demo site… The pdf was rendering correctly. Glad to know.
Tentatively when can we expect the updated nuget. 22.9
Does the temparary license of Aspose.OCR has a limitation of scanning only 10 pages?
The Help tooltip of RecognizePdf() of 22.8 says “Do not support searchable PDF”… So got little confused and used two step process to save PDF
When I used your way 1 step to save img to pdf… the searchable pdf got created but the size was less. Also the quality got degraded and images for cut so I will be using 2 step save process
We will be releasing 22.9 version of the API soon before the end of this month i.e. September 2022.
No, it is the limitation of the API that it can process maximum 10 pages of a PDF to recognize. We have limited this number in order to make performance better. You can recognize the entire document in parts of 10 pages in a loop.
In the tooltip, searchable PDF means a PDF that has mixed content e.g. text and images. The API is recommended and made to create searchable PDFs where source PDFs have only scanned images.
Can you please share two different code snippets with us which you are actually using to compare results of both approaches? It would help us in investigating the scenario accordingly.
@asad.ali
Also Is any reason why RecognizeTiff() do not have any overload that take memory stream. We actually need a memory overload , Can you provide a memory stream over load for RecognizeTiff() too? It would be of great help.
RecognizeImage() function has the overload to take memory stream which we exactly want but RecognitionSettings do not have Startpage and PageNumber Property to loop for all pages of TIF
Can we also have StartPage and Pagenumber property RecognitionSettings as we had for DocumentRecognitionSettings ?
Thanks[quote=“asad.ali, post:11, topic:252342”]
We will surely look into details of these ticket and let you know as soon as they are resolved. Please be patient and spare us some time
[/quote]
@asad.ali
Thank you so much for considering the need. Will be looking forward to it
The ticket has been resolved and it will be included in the 22.10 version of the API. As soon as the fixed-in version is available, we will let you know.
The issues you have found earlier (filed as OCRNET-584) have been fixed in this update. This message was posted using Bugs notification tool by anna.pylaieva