I’m using the Document.Convert Method (Document.CallBackGetHocr)
to convert non-searchable pdf to searchable. On Windows, everything is working fine but on Linux sometimes there is an Object reference not set to an instance of an object
exception. Can you please check what causing this?
System.NullReferenceException: Object reference not set to an instance of an object.
at #=zUr3fyKG1IneV0AIo$ZJrHNxAgCWrMOraZO9msTDB6W4L1s$Be5ATPPU=.#=zuSc5NR0=(#=z3tDFAvFbCZ4PoKhyWn1BZYH2ZAOoB3i6FLnXqfmLsh9o #=z69tfiEUqfu$v)
at #=zn$29lVNwSHOIhgCpP$s8qRUAxdcy1kTcFYiLfDs0BwbHSfCTlZ5OEafWwZcsk9dKKKTrN3$T2iRX.#=zl9b8KPg=(String #=zdkzBYmY=, #=z0xbEtHVunPxvflL$O3Af58SuChIjPLiVAJGgbM= #=zSZt9qIM=, #=zXXJGE6zwmDAguaHaPQVRo797BG1jLd_f0_WeUJ8js5anfuxlYZQ2hDo= #=z4ZCvveueTtMo, Boolean #=zo498M7A=, Boolean #=zy_fFi4g=, #=zaKRYDaRKfNUbWk69OrBg$Fr2lIxelkNDOw==& #=zFJKWf4M=, #=z3tDFAvFbCZ4PoKhyWn1BZYH2ZAOoB3i6FLnXqfmLsh9o& #=z69tfiEUqfu$v, String& #=zjzDo9inI_H9)
at #=zUr3fyKG1IneV0AIo$ZJrHNxAgCWrMOraZO9msTDB6W4L1s$Be5ATPPU=.#=zKLJ78x8=(#=zP3Oj0_YpGtBc0IY9ZZL8k_CcyQwLygSSZVRUtogou$aSmT1hcQZwEy85cycvObZ8Q8lQCCAKuCDPXycTRA==[] #=zdGwkVbUobqP8ey3S9uhNQRa7M6az, String #=zdkzBYmY=, #=z0xbEtHVunPxvflL$O3Af58SuChIjPLiVAJGgbM= #=zSZt9qIM=, #=zXXJGE6zwmDAguaHaPQVRo797BG1jLd_f0_WeUJ8js5anfuxlYZQ2hDo= #=z4ZCvveueTtMo, Boolean #=zo498M7A=, Boolean #=zy_fFi4g=, #=zaKRYDaRKfNUbWk69OrBg$Fr2lIxelkNDOw==& #=zR8jWpoo=, #=z3tDFAvFbCZ4PoKhyWn1BZYH2ZAOoB3i6FLnXqfmLsh9o& #=z69tfiEUqfu$v, String& #=zjzDo9inI_H9)
at #=zT402Il0$NabGYcvoJ7Px6s9V0jMsnSWn_zK7y82LEID7s$lRgQ==.#=zV42_4BgisSQ7(#=zJLSE8nCoUUPRmjpknNaFG3zEIIrxISVshA== #=zVCBSm1o=, TextEditOptions #=z4RSMOLM=)
at Aspose.Pdf.Text.TextSegment.set_Text(String value)
at Aspose.Pdf.Text.TextSegment.#=zhbn1PHk=(#=zigVcVy0kF4TRalKjQfSZwcYEHcgRyhXISOYJy89zQBEgDdBz12CmBt8= #=z9M1yQrGcoAw4)
at Aspose.Pdf.Text.TextBuilder.#=zIPkcA5E=(TextFragment #=zxCOHDNoKqH5i, Int32 #=zSVREwdE=, Boolean #=z3QGA9rEx6vgG)
at Aspose.Pdf.Text.TextBuilder.#=zo7_2BsVc87cY(TextParagraph #=zlTAuhZgwDpby, Int32 #=zSVREwdE=)
at #=ztfmBK0IQTD8RH0KlEQr_DdejyCjPQVXjh$3yOafcgxPT9LWvqZUqels=.#=z2pyN7$M=(String #=z6oMWFFq5jp2a, Rectangle #=zAJHLYmDDP$pW, Image #=zk0fsclY=, Page #=zcwftyS8=, TextBuilder #=zUf6ivZc=, Single #=zSVREwdE=)
at #=ztfmBK0IQTD8RH0KlEQr_DdejyCjPQVXjh$3yOafcgxPT9LWvqZUqels=.#=zQIB0o7k=(CallBackGetHocr #=zeTAMaTT7hlvH, Document #=zGB76UFY=)
at Aspose.Pdf.Document.Convert(CallBackGetHocr callback)
With these sample pdfs the exception can be reproduced.
sample_files.zip (1019.5 KB)
C# .Net Core 3.1
Aspose.PDF 20.10
tesseract 4.1.1
leptonica-1.79.0
I created a simple .Net Core project so you can try to reproduce it.
ConvertToSearchable.zip (2.6 KB)
I have another question. My main goal is to search and redact some text (words) in the pdf and because of this issue I cannot use the Convert() method on Linux when the pdf contains some images. I thought that I will extract the images with ImagePlacementAbsorber, OCR with tesseract, and convert the hocr data bbox coordinates to a Rectangle and redact that area in the pdf. Can you help me how could I convert these boundary box coordinates to a Rectangle if I also know the image position (Rectangle) on the page?
Thanks,
Gabor