Converting non searchable PDF to searchable PDF document does not work

Hi,

I wanted to test the HOCR merge capabilities of aspose PDF and implemented a small program using tesseract based on this article from the documentation. However this does not seem to work.

The code I us is:

   private String doTesseractOCR(java.awt.image.BufferedImage img) throws Exception {

Tesseract instance = new Tesseract(); // JNA Interface Mapping
instance.setHocr(true);

try {
String result = instance.doOCR(img);
System.out.println(result);
return result;
} catch (TesseractException e) {
throw new Exception(“Failed to process”, e);
}
}

private void mergeResult(Path outfile ) throws Exception {
License lic = new License();
lic.setLicense(“Aspose.Total.Java.lic”);

Document doc = new Document(inFile.toString());


Document.CallBackGetHocr cbgh = new Document.CallBackGetHocr()
{
public String invoke(java.awt.image.BufferedImage img) {



try {
return doTesseractOCR(img);
} catch (Exception e) {
System.out.println(“failed to recognise img”);
e.printStackTrace();
return “”;
}
}
};
// End callBack

System.out.println(“converting”);
doc.convert(cbgh);
System.out.println(“saving”);
doc.save(outfile.toString());
doc.dispose();
}

The doTesseractOCR Function returns hocr corectly and the calback is called by the convert funtion.
However when the file is saved there are no text available in the document.

I already used multiple versions (10.0, 10.9 and 11.2) and multiple pdf documents all with the same result.

Thanks for your assistance,


Hi there,


Thanks for your inquriy. We will appreciate it if you please share your sample source PDF document here, we will test the scenario and will provide you information accordingly.

We are sorry for the inconvenience caused.

Best Regards,

Hello,

Attached you find an input file and the result file.

Kind regards,
Michiel

Hi Michiel,


Thanks for sharing the sample document. I am looking into it and will update you soon.

Best Regards,

Hi Tilal,

Do you have any news on this request?

Kind Regards,
Michiel

Hi Michiel,


We are sorry for the inconvenience. After initial investigation we have logged a ticket PDFNEWJAVA-35599 in our issue tracking system for further investigation. We will keep you updated about the issue resolution progress within this forum thread.

Best Regards,

Hi,

Do you have an update on this?
Can we get a commitment on when this will be fixed if we forward this to priority support? If so please do.

Kind Regards,
Michiel

Hi Michiel,


Thanks for your patience.

I am afraid the earlier reported issue is not yet resolved. Please note that the issues are resolved in first come first serve basis as we believe its the fairest policy with all the customers. However the problems logged/reported under Enterprise or Priority support model have
high precedence in terms of resolution, as compared to issues under normal/free
support model.

Nonetheless, ES/PS support does not guarantee any
immediate resolution of issues (because it might be dependent on other
issues or feature which needs to be implemented
) but under this model, the
development team starts investigating the problem on high priority. Meanwhile I have intimated the product team to evaluate earlier reported issue and share if raising the priority can be beneficial or not.

We are sorry for this delay and inconvenience.

Hi Michiel,


Our product has started to review the issue and we need some additional information for the investigation. As you stated above “The doTesseractOCR Function returns hocr corectly and the calback is called by the convert funtion.” , please share the hocr string that you are getting from doTesseractOCR function. We will look into it and will keep you updated about the issue resolution progress.

Best Regards,

Hi,

Attached a text file with the output of the tesseract hocr.
It includes multiple parts. the <DOCTYPE starts a new part.

Kind Regards,
Michiel


Hi Michiel,


Thanks for sharing the OCR output, we have passed on the information to our product team. We will notify you as soon as we made further progress towards issue resolution.

Thanks for your patience and cooperation.

Best Regards,

Hi,


Any news on this?
Is there any way we can speed this up? This is a documented functionality that is not working…

Kind Regards,
Michiel

Hi Michiel,


Thanks for your inquiry. We have good news for you, your above reported issue has been resolved and its fix will be included in upcoming release i.e. Aspose.Pdf for Java 11.5.0. Hopefully it will be published in start of May, 2016. However as soon as it is published and gets available for download, we will notify you as well.

Thanks for your patience and cooperation.

Best Regards,

The issues you have found earlier (filed as PDFNEWJAVA-35599) have been fixed in Aspose.Pdf for Java 11.5.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

Hi,


I just tested this with the new Aspose PDF 11.5.0 and it is still not working…
Was this “fix” tested and if so what is different in my code that it is’nt working?

Kind regards,
Michiel

Hi Michiel,


Please note The attached txt file has no namespace in the html tag.
It is necessary to declare the following namespace

<o:p></o:p>

...

Also, have been fixed the problem with bbox values.. Taking above into account, I have divided the attached tesseract+output.txt file into 4 separated files and have created a searchable PDF successfully.

final int[] page = {1};

Document doc = new Document(myDir + "in (1).pdf");

doc.convert(new Document.CallBackGetHocr()

{

@Override

public String invoke(java.awt.image.BufferedImage bi)

{

try

{

int len;

char[] chr = new char[4096];

final StringBuffer buffer = new StringBuffer();

final FileReader reader = new FileReader(myDir + "tesseract+output+page"+ page[0]++ +".txt");

try

{

while ((len = reader.read(chr)) > 0)

{

buffer.append(chr, 0, len);

}

} finally

{

reader.close();

}

return buffer.toString();

} catch (FileNotFoundException e)

{

// e.printStackTrace();

} catch (java.lang.Exception exc)

{

// exc.printStackTrace();

}

return null;

}

});

doc.save(myDir + "out_1150.pdf");


Best Regards,