Hi
Hi Carlos,
Hi Carlos,
Thanks for your patience and sorry for the delayed response.
I have tested the scenario using the following code snippet. I used Aspose.Pdf for Java 9.5.2 and tess4j_1_3_0 in an Eclipse Juno project running over Windows 7 (x64) with JDK 1.7. I am afraid to have encountered the following error message.
Java
Document doc = new Document("c:/pdftest/1704-01-2012-017-C003-025.pdf");
doc.convert(new Document.CallBackGetHocr() {
@Override
public String invoke(java.awt.image.BufferedImage bi) {
try {
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
instance.setHocr(true);
instance.setLanguage("spa");
String result = instance.doOCR(bi);
return result;
} catch (TesseractException ex) {
ex.printStackTrace();
}
return null;
}
});
doc.save("c:/pdftest/1704-01-2012-017-C003-025_ASPOSE_9_5_2.pdf");
StackTrace
Exception in thread "main" java.lang.UnsatisfiedLinkError: The specified module could not be found.
at com.sun.jna.Native.open(Native Method)
at com.sun.jna.Native.open(Native.java:1759)
at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:260)
at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:398)
at com.sun.jna.Library$Handler.(Library.java:147)
at com.sun.jna.Native.loadLibrary(Native.java:412)
at com.sun.jna.Native.loadLibrary(Native.java:391)
at net.sourceforge.tess4j.TessAPI.(TessAPI.java:45)
at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:283)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:219)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:200)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:184)
at test$1.invoke(test.java:222)
at com.aspose.pdf.internal.p476.z19.m1(Unknown Source)
at com.aspose.pdf.ADocument.convert(Unknown Source)
at com.aspose.pdf.Document.convert(Unknown Source)
at test.main(test.java:215)
I am afraid I am unable to understand your point “You can avoid using Tesseract by making invoke return the contents of the file ‘hocr.txt’.” Can you please share some further details or code snippet that can help us in replicating the issue in our environment?
Hi, thanks for your reply … the error is because Tesseract cannot find its dependencies (DLLs). To avoid Tesseract you have to modify the invoke method. hocr.txt is attached to my first post. Method FileUtils.readFileToByteArray can be found in commons-io library from Apache and is used to load the file into a byte array.
Thanks for your attention
public String invoke(BufferedImage bi) {
try{
String hocr = new String(FileUtils.readFileToByteArray(new File()));
return hocr;
}
catch(Exception exc){
exc.printStackTrace();
}
return null;
}
Hi Carlos,
Thanks for sharing the details.
I have tested the scenario using the following code snippet and have observed that a searchable PDF file is not being generated. For the sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-34536. We will investigate this issue in detail and will keep you updated on the status of a correction.
We apologize for your inconvenience.
[Java]
Document doc = new Document("c:/pdftest/1704-01-2012-017-C003-025.pdf");
doc.convert(new Document.CallBackGetHocr() {
@Override
public String invoke(java.awt.image.BufferedImage bi) {
try {
int len;
char[] chr = new char[4096];
final StringBuffer buffer = new StringBuffer();
final FileReader reader = new FileReader("c:/pdftest/hocr.txt");
try {
while ((len = reader.read(chr)) > 0) {
buffer.append(chr, 0, len);
}
} finally {
reader.close();
}
return buffer.toString();
} catch (Exception exc) {
exc.printStackTrace();
}
return null;
}
});
doc.save("c:/pdftest/1704-01-2012-017-C003-025_ASPOSE_9_5_2.pdf");
Hi,
Hi Bogdan,
Hi, i downloaded Aspose PDF for Java 9.7.1 and 10.0.0 to test the changes on the subject with no success.
Hi Carlos,