In Searchable PDF the search is not on the exact word..it lags behind

Gpatil · September 21, 2022, 7:03pm

Below are my setttings.

DocumentRecognitionSettings DRS = new DocumentRecognitionSettings()
{
PagesNumber = 0,

                PreprocessingFilters = filter,
                //// allowed options
                AllowedCharacters = CharactersAllowedType.ALL, // ignore not latin symbols
                                                               // AutoContrast = false, // use Contrast correction filter before recognition - good for images with noice 
                AutoSkew = false, // switch off if your image not rotated

                DetectAreas = true, // switch off if your image has a simple document structure (one column text without pictures)

                DetectAreasMode = DetectAreasMode.COMBINE,// depends on the structure of your image

                IgnoredCharacters = "", // define the symbols you want to ignore in the recognition result

                Language = Language.Eng, // we support 26 languages
                //ThreadsCount = 15,
                LinesFiltration = true, // this works slowly, so choose it only if your picture has lines and it they bad detected in TABLE ar DOCUMENT DetectAreasMode   
                                        // ThreadsCount = 1, // by default our API use all you threads. But you can run it in one thread. Simply set up this here
                                        // ThresholdValue = 150 // if you want to binarize image with your own threashold value, you can set up this here (from 1 to 255)
            };

Can you tell what am I missing

asad.ali · September 22, 2022, 4:21am

@Gpatil

Would you please share the complete code snippet with your sample PDF document? We will test the scenario in our environment and address it accordingly.

Gpatil · September 22, 2022, 8:36pm

Code&Files.zip (524.4 KB)

Hi Asad, Please find the file attached

You may need to call Aspose_OCR_Test.ASPOSEOCRTESTING(Filepath).

I have used 22.8.0 (Aspose.Ocr nuget) and application is in Net 3.1

I have also attached my sample searchable PDF

Thank you so much @asad.ali !

asad.ali · September 23, 2022, 12:33pm

@Gpatil

We generated a searchable PDF in our environment using below code snippet and observed the issue for which a screenshot is attached. Could you please confirm if same is the issue you are noticing?

// C# Code
            try
            {
                var api = new OCR.AsposeOcr();

                var settings = new OCR.DocumentRecognitionSettings();
                settings.StartPage = 0;
                settings.PagesNumber = 6;
                //settings.LinesFiltration = true;
                settings.DetectAreas = true;
                settings.DetectAreasMode = OCR.DetectAreasMode.COMBINE;
                settings.ThreadsCount = 1;

                var res = api.RecognizePdf(dataDir + "A500008704_20220128_100605.tif_Searchable.pdf", settings);

                OCR.AsposeOcr.SaveMultipageDocument(dataDir + "File1_OCRd.pdf", OCR.SaveFormat.Pdf, res);

            }
            catch (Exception ex)
            {
                throw ex;
            }

image.png (146.0 KB)

Gpatil · September 24, 2022, 2:02pm

@asad.ali

Yes , I am facing the similar issue, In my case too the search shadow is on same line but far behind from the exact word. See the attached sample for reference
==> I am looking for the word “this”… It has 3 occurrence but it show the search few letter back

Also I noticed in your code you did
Image to PDF in 1 step
I am doing Image to PDF and then PDF to searchable pdf.

Which is the correct way for searchable PDF ?

Offset_shift_search.jpg (95.1 KB)

asad.ali · September 24, 2022, 7:28pm

@Gpatil

Yes, the issue is same at our end. It is being caused due to incorrect font size of text API places on the image inside PDF. We are trying to rectify this issue and improve this feature in upcoming release i.e. 22.9.

Aspose.OCR initially used to provide features of recognizing images and saving the results into different file formats like PDF, OCR, etc. Creating searchable PDF was later implemented to perform OCR operation on scanned PDF documents. You can use any way that suits your requirements and use cases.

Gpatil · September 25, 2022, 1:02pm

@asad.ali

Thank you for the update. Also today when I uploaded the same image on aspose.com Text to PDF Demo site… The pdf was rendering correctly. Glad to know.

Tentatively when can we expect the updated nuget. 22.9
Does the temparary license of Aspose.OCR has a limitation of scanning only 10 pages?
The Help tooltip of RecognizePdf() of 22.8 says “Do not support searchable PDF”… So got little confused and used two step process to save PDF
When I used your way 1 step to save img to pdf… the searchable pdf got created but the size was less. Also the quality got degraded and images for cut so I will be using 2 step save process

I have attached 3 samples for same BrainScan.1StepWay.Searchable.pdf (1.9 MB)
BrainScan.2StepWay.Searchable.pdf (2.0 MB)
Observation.JPG (14.4 KB)
OriginalBrainScan.jpg (215.8 KB)

asad.ali · September 25, 2022, 8:14pm

@Gpatil

We will be releasing 22.9 version of the API soon before the end of this month i.e. September 2022.

No, it is the limitation of the API that it can process maximum 10 pages of a PDF to recognize. We have limited this number in order to make performance better. You can recognize the entire document in parts of 10 pages in a loop.

In the tooltip, searchable PDF means a PDF that has mixed content e.g. text and images. The API is recommended and made to create searchable PDFs where source PDFs have only scanned images.

Can you please share two different code snippets with us which you are actually using to compare results of both approaches? It would help us in investigating the scenario accordingly.

Gpatil · September 26, 2022, 12:06pm

@asad.ali
The first approach is what you shared earlier ( just pasting the same code)

// C# Code
            try
            {
                var api = new OCR.AsposeOcr();

                var settings = new OCR.DocumentRecognitionSettings();
                settings.StartPage = 0;
                settings.PagesNumber = 6;
                //settings.LinesFiltration = true;
                settings.DetectAreas = true;
                settings.DetectAreasMode = OCR.DetectAreasMode.COMBINE;
                settings.ThreadsCount = 1;

                var res = api.RecognizePdf(dataDir + "A500008704_20220128_100605.tif_Searchable.pdf", settings);

                OCR.AsposeOcr.SaveMultipageDocument(dataDir + "File1_OCRd.pdf", OCR.SaveFormat.Pdf, res);

            }
            catch (Exception ex)
            {
                throw ex;
            }

The second approach is same as I shared code earlier in Code&Files.zip after the first post

var api = new Aspose.OCR.AsposeOcr();
            var settings = new Aspose.OCR.DocumentRecognitionSettings();
            settings.StartPage = 0;
            settings.PagesNumber = 6;
            //settings.LinesFiltration = true;
            settings.DetectAreas = true;
            settings.DetectAreasMode = Aspose.OCR.DetectAreasMode.COMBINE;
            settings.ThreadsCount = 1;
            RecognitionSettings RS = new RecognitionSettings();
            RS.DetectAreasMode = Aspose.OCR.DetectAreasMode.COMBINE;
            RS.ThreadsCount = 1;

            //Step 1
            List<RecognitionResult> lst = new List<RecognitionResult>();
            lst.Add(api.RecognizeImage(dataDir + "A500008704_20220128_100605.tif", RS));
            AsposeOcr.SaveMultipageDocument(dataDir + "A500008704_20220128_100605.tif_Searchable.pdf", SaveFormat.Pdf, lst);

            //Step 2
            List<RecognitionResult> res = new List<RecognitionResult>();
            res = api.RecognizePdf(dataDir + "A500008704_20220128_100605.tif_Searchable.pdf", settings);
            Aspose.OCR.AsposeOcr.SaveMultipageDocument(dataDir + "File1_OCRd.pdf", Aspose.OCR.SaveFormat.Pdf, res);

Gpatil · September 26, 2022, 6:06pm

@asad.ali
Also Is any reason why RecognizeTiff() do not have any overload that take memory stream. We actually need a memory overload , Can you provide a memory stream over load for RecognizeTiff() too? It would be of great help.

RecognizeImage() function has the overload to take memory stream which we exactly want but RecognitionSettings do not have Startpage and PageNumber Property to loop for all pages of TIF

Can we also have StartPage and Pagenumber property RecognitionSettings as we had for DocumentRecognitionSettings ?

asad.ali · September 26, 2022, 6:41pm

@Gpatil

Following tickets have been logged in our issue tracking system against your concerns:

OCRNET-583

OCRNET-584

OCRNET-585

We will surely look into details of these ticket and let you know as soon as they are resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

Gpatil · September 26, 2022, 6:46pm

Thanks[quote=“asad.ali, post:11, topic:252342”]
We will surely look into details of these ticket and let you know as soon as they are resolved. Please be patient and spare us some time
[/quote]

@asad.ali
Thank you so much for considering the need. Will be looking forward to it

Gpatil · October 3, 2022, 12:13pm

@asad.ali

OCRNET-584 ---- Status : Resolved

Glad to know Asad , this issue has been resolved so quickly . By when can we expect this new change in the nuget

asad.ali · October 3, 2022, 8:08pm

@Gpatil

The ticket has been resolved and it will be included in the 22.10 version of the API. As soon as the fixed-in version is available, we will let you know.

aspose.notifier · November 2, 2022, 11:17am

The issues you have found earlier (filed as OCRNET-584) have been fixed in this update. This message was posted using Bugs notification tool by anna.pylaieva