Converting non searchable PDF to searchable PDF document does not work

geert_vanpeteghem_docshifter_com · February 16, 2016, 4:00am

Hi,

I wanted to test the HOCR merge capabilities of aspose PDF and implemented a small program using tesseract based on this article from the documentation. However this does not seem to work.

The code I us is:

   private String doTesseractOCR(java.awt.image.BufferedImage img) throws Exception {

      Tesseract instance = new Tesseract();  // JNA Interface Mapping
      instance.setHocr(true);

      try {
         String result = instance.doOCR(img);
         System.out.println(result);
         return result;
      } catch (TesseractException e) {
         throw new Exception(“Failed to process”, e);
      }
   }

   private void mergeResult(Path outfile ) throws Exception {
      License lic = new License();
      lic.setLicense(“Aspose.Total.Java.lic”);

      Document doc = new Document(inFile.toString());


      Document.CallBackGetHocr cbgh = new Document.CallBackGetHocr()
      {
         public String invoke(java.awt.image.BufferedImage img) {



            try {
               return doTesseractOCR(img);
            } catch (Exception e) {
               System.out.println(“failed to recognise img”);
               e.printStackTrace();
               return “”;
            }
         }
      };
     // End callBack

      System.out.println(“converting”);
      doc.convert(cbgh);
      System.out.println(“saving”);
      doc.save(outfile.toString());
      doc.dispose();
   }

The doTesseractOCR Function returns hocr corectly and the calback is called by the convert funtion. 
However when the file is saved there are no text available in the document.

I already used multiple versions (10.0, 10.9 and 11.2) and multiple pdf documents all with the same result.

Thanks for your assistance,

tilal.ahmad · February 17, 2016, 2:56am

Hi there,

Thanks for your inquriy. We will appreciate it if you please share your sample source PDF document here, we will test the scenario and will provide you information accordingly.

We are sorry for the inconvenience caused.

Best Regards,

geert_vanpeteghem_docshifter_com · February 17, 2016, 9:12am

Hello,

Attached you find an input file and the result file.

Kind regards,
Michiel

tilal.ahmad · February 18, 2016, 4:21am

Hi Michiel,

Thanks for sharing the sample document. I am looking into it and will update you soon.

Best Regards,

geert_vanpeteghem_docshifter_com · February 29, 2016, 4:52am

Hi Tilal,

Do you have any news on this request?

Kind Regards,
Michiel

tilal.ahmad · March 1, 2016, 12:04am

Hi Michiel,

We are sorry for the inconvenience. After initial investigation we have logged a ticket PDFNEWJAVA-35599 in our issue tracking system for further investigation. We will keep you updated about the issue resolution progress within this forum thread.

Best Regards,

geert_vanpeteghem_docshifter_com · April 8, 2016, 5:33am

Hi,

Do you have an update on this?
Can we get a commitment on when this will be fixed if we forward this to priority support? If so please do.

Kind Regards,
Michiel

codewarior · April 10, 2016, 12:05pm

Hi Michiel,

Thanks for your patience.

I am afraid the earlier reported issue is not yet resolved. Please note that the issues are resolved in first come first serve basis as we believe its the fairest policy with all the customers. However the problems logged/reported under Enterprise or Priority support model have
high precedence in terms of resolution, as compared to issues under normal/free
support model.

Nonetheless, ES/PS support does not guarantee any
immediate resolution of issues (because it might be dependent on other
issues or feature which needs to be implemented) but under this model, the
development team starts investigating the problem on high priority. Meanwhile I have intimated the product team to evaluate earlier reported issue and share if raising the priority can be beneficial or not.

We are sorry for this delay and inconvenience.

tilal.ahmad · April 11, 2016, 10:17am

Hi Michiel,

Our product has started to review the issue and we need some additional information for the investigation. As you stated above “The doTesseractOCR Function returns hocr corectly and the calback is called by the convert funtion.” , please share the hocr string that you are getting from doTesseractOCR function. We will look into it and will keep you updated about the issue resolution progress.

Best Regards,

geert_vanpeteghem_docshifter_com · April 11, 2016, 11:10am

Hi,

Attached a text file with the output of the tesseract hocr.
It includes multiple parts. the <DOCTYPE starts a new part.

Kind Regards,
Michiel

tilal.ahmad · April 11, 2016, 11:40pm

Hi Michiel,

Thanks for sharing the OCR output, we have passed on the information to our product team. We will notify you as soon as we made further progress towards issue resolution.

Thanks for your patience and cooperation.

Best Regards,

geert_vanpeteghem_docshifter_com · April 21, 2016, 4:26am

Hi,

Any news on this?

Is there any way we can speed this up? This is a documented functionality that is not working…

Kind Regards,

Michiel

tilal.ahmad · April 21, 2016, 11:31pm

Hi Michiel,

Thanks for your inquiry. We have good news for you, your above reported issue has been resolved and its fix will be included in upcoming release i.e. Aspose.Pdf for Java 11.5.0. Hopefully it will be published in start of May, 2016. However as soon as it is published and gets available for download, we will notify you as well.

Thanks for your patience and cooperation.

Best Regards,

aspose.notifier · May 10, 2016, 2:14pm

The issues you have found earlier (filed as PDFNEWJAVA-35599) have been fixed in Aspose.Pdf for Java 11.5.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

geert_vanpeteghem_docshifter_com · May 11, 2016, 3:11am

Hi,

I just tested this with the new Aspose PDF 11.5.0 and it is still not working…

Was this “fix” tested and if so what is different in my code that it is’nt working?

Kind regards,

Michiel

tilal.ahmad · May 11, 2016, 10:51am

Hi Michiel,

Please note The attached txt file has no namespace in the html tag.

It is necessary to declare the following namespace

<o:p></o:p>

...

Also, have been fixed the problem with bbox values.. Taking above into account, I have divided the attached tesseract+output.txt file into 4 separated files and have created a searchable PDF successfully.

final int[] page = {1};

Document doc = new Document(myDir + "in (1).pdf");

doc.convert(new Document.CallBackGetHocr()

{

@Override

public String invoke(java.awt.image.BufferedImage bi)

{

try

{

int len;

char[] chr = new char[4096];

final StringBuffer buffer = new StringBuffer();

final FileReader reader = new FileReader(myDir + "tesseract+output+page"+ page[0]++ +".txt");

try

{

while ((len = reader.read(chr)) > 0)

{

buffer.append(chr, 0, len);

}

} finally

{

reader.close();

}

return buffer.toString();

} catch (FileNotFoundException e)

{

// e.printStackTrace();

} catch (java.lang.Exception exc)

{

// exc.printStackTrace();

}

return null;

}

});

doc.save(myDir + "out_1150.pdf");

Best Regards,