Exception occurred when converting non-searchable pdf to searchable with Tesseract using Aspose.PDF

erdeiga · October 9, 2020, 2:38pm

I’m using the Document.Convert Method (Document.CallBackGetHocr) to convert non-searchable pdf to searchable. On Windows, everything is working fine but on Linux sometimes there is an Object reference not set to an instance of an object exception. Can you please check what causing this?

System.NullReferenceException: Object reference not set to an instance of an object.
at #=zUr3fyKG1IneV0AIo$ZJrHNxAgCWrMOraZO9msTDB6W4L1s$Be5ATPPU=.#=zuSc5NR0=(#=z3tDFAvFbCZ4PoKhyWn1BZYH2ZAOoB3i6FLnXqfmLsh9o #=z69tfiEUqfu$v)
at #=zn$29lVNwSHOIhgCpP$s8qRUAxdcy1kTcFYiLfDs0BwbHSfCTlZ5OEafWwZcsk9dKKKTrN3$T2iRX.#=zl9b8KPg=(String #=zdkzBYmY=, #=z0xbEtHVunPxvflL$O3Af58SuChIjPLiVAJGgbM= #=zSZt9qIM=, #=zXXJGE6zwmDAguaHaPQVRo797BG1jLd_f0_WeUJ8js5anfuxlYZQ2hDo= #=z4ZCvveueTtMo, Boolean #=zo498M7A=, Boolean #=zy_fFi4g=, #=zaKRYDaRKfNUbWk69OrBg$Fr2lIxelkNDOw==& #=zFJKWf4M=, #=z3tDFAvFbCZ4PoKhyWn1BZYH2ZAOoB3i6FLnXqfmLsh9o& #=z69tfiEUqfu$v, String& #=zjzDo9inI_H9)
at #=zUr3fyKG1IneV0AIo$ZJrHNxAgCWrMOraZO9msTDB6W4L1s$Be5ATPPU=.#=zKLJ78x8=(#=zP3Oj0_YpGtBc0IY9ZZL8k_CcyQwLygSSZVRUtogou$aSmT1hcQZwEy85cycvObZ8Q8lQCCAKuCDPXycTRA==[] #=zdGwkVbUobqP8ey3S9uhNQRa7M6az, String #=zdkzBYmY=, #=z0xbEtHVunPxvflL$O3Af58SuChIjPLiVAJGgbM= #=zSZt9qIM=, #=zXXJGE6zwmDAguaHaPQVRo797BG1jLd_f0_WeUJ8js5anfuxlYZQ2hDo= #=z4ZCvveueTtMo, Boolean #=zo498M7A=, Boolean #=zy_fFi4g=, #=zaKRYDaRKfNUbWk69OrBg$Fr2lIxelkNDOw==& #=zR8jWpoo=, #=z3tDFAvFbCZ4PoKhyWn1BZYH2ZAOoB3i6FLnXqfmLsh9o& #=z69tfiEUqfu$v, String& #=zjzDo9inI_H9)
at #=zT402Il0$NabGYcvoJ7Px6s9V0jMsnSWn_zK7y82LEID7s$lRgQ==.#=zV42_4BgisSQ7(#=zJLSE8nCoUUPRmjpknNaFG3zEIIrxISVshA== #=zVCBSm1o=, TextEditOptions #=z4RSMOLM=)
at Aspose.Pdf.Text.TextSegment.set_Text(String value)
at Aspose.Pdf.Text.TextSegment.#=zhbn1PHk=(#=zigVcVy0kF4TRalKjQfSZwcYEHcgRyhXISOYJy89zQBEgDdBz12CmBt8= #=z9M1yQrGcoAw4)
at Aspose.Pdf.Text.TextBuilder.#=zIPkcA5E=(TextFragment #=zxCOHDNoKqH5i, Int32 #=zSVREwdE=, Boolean #=z3QGA9rEx6vgG)
at Aspose.Pdf.Text.TextBuilder.#=zo7_2BsVc87cY(TextParagraph #=zlTAuhZgwDpby, Int32 #=zSVREwdE=)
at #=ztfmBK0IQTD8RH0KlEQr_DdejyCjPQVXjh$3yOafcgxPT9LWvqZUqels=.#=z2pyN7$M=(String #=z6oMWFFq5jp2a, Rectangle #=zAJHLYmDDP$pW, Image #=zk0fsclY=, Page #=zcwftyS8=, TextBuilder #=zUf6ivZc=, Single #=zSVREwdE=)
at #=ztfmBK0IQTD8RH0KlEQr_DdejyCjPQVXjh$3yOafcgxPT9LWvqZUqels=.#=zQIB0o7k=(CallBackGetHocr #=zeTAMaTT7hlvH, Document #=zGB76UFY=)
at Aspose.Pdf.Document.Convert(CallBackGetHocr callback)

With these sample pdfs the exception can be reproduced.
sample_files.zip (1019.5 KB)

C# .Net Core 3.1
Aspose.PDF 20.10
tesseract 4.1.1
leptonica-1.79.0

I created a simple .Net Core project so you can try to reproduce it.
ConvertToSearchable.zip (2.6 KB)

I have another question. My main goal is to search and redact some text (words) in the pdf and because of this issue I cannot use the Convert() method on Linux when the pdf contains some images. I thought that I will extract the images with ImagePlacementAbsorber, OCR with tesseract, and convert the hocr data bbox coordinates to a Rectangle and redact that area in the pdf. Can you help me how could I convert these boundary box coordinates to a Rectangle if I also know the image position (Rectangle) on the page?

Thanks,
Gabor

asad.ali · October 11, 2020, 4:31pm

We have logged a ticket as PDFNET-48890 in our issue tracking system against the issue which is being faced. We will further look into its details and keep you informed with the status of its rectification. Please be patient and spare us some time.

Could you please share sample extracted values from coordinates along with respective PDF? We will try to use those values in order to redact the text in the PDF and share our feedback with you accordingly.

erdeiga · October 12, 2020, 2:11pm

Hi @asad.ali,

I attached an example pdf with extracted images and the hocr data.

The images rectangles are:

On page 2: 56.70000076293945,422.8890075683594,538.5999946594238,785.239013671875
On page 3: 56.70000076293945,380.6889953613281,538.5999946594238,785.2389831542969

For example how could I redact the 5555555555554444 on the second page if this is the bbox coordinates of the word on the image?

<span class='ocrx_word' id='word_1_71' title='bbox 202 354 388 370; x_wconf 91'>5555555555554444</span>

Extracted_images_hocr_data.zip (1.5 MB)

asad.ali · October 12, 2020, 10:41pm

@erdeiga

You can draw a rectangle using following code in the PDF to redact certain portion:

Document doc = new Document(dataDir + "sample_file.pdf");
Page page = doc.Pages[2];
var canvas = new Drawing.Graph((float)page.PageInfo.Width, (float)page.PageInfo.Height);//tf.Rectangle.Width, pg.PageInfo.Height
page.Paragraphs.Add(canvas);
Aspose.Pdf.Drawing.Rectangle rect = new Aspose.Pdf.Drawing.Rectangle(202, 354, 72, 30); //new Aspose.Pdf.Drawing.Rectangle(0, 700, 100, 750);
rect.GraphInfo.Color = Aspose.Pdf.Color.FromRgb(c);
canvas.Shapes.Add(rect);
doc.Save(dataDir + "CircleGraph.pdf");

However, the coordinates extracted by tesseract are not which a rectangle requires. You can see in above code snippet where Aspose.Pdf.Drawing.Rectangle() takes the arguments of left, bottom, width and height. We tried to add the rectangle using the values which you shared, but it did not work. We are afraid that it may not be possible to use these values in order to draw rectangle at desired location.

asad.ali · April 11, 2022, 7:54pm

2 posts were split to a new topic: Convert Scanned PDF to Searchable PDF using C#