Problem with HOCR

carlosl · September 18, 2014, 11:35am

Hi

I’m having an issue trying to embed HOCR data into a PDF

My code looks like this

Document doc = new Document(“c:/bad/1704-01-2012-017-C003-029.pdf”);

doc.convert(new Document.CallBackGetHocr() {

@Override

public String invoke(BufferedImage bi) {

try {

Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping

instance.setHocr(true);

instance.setLanguage(“spa”);

String result = instance.doOCR(bi);

return result;

} catch (TesseractException ex) {

ex.printStackTrace();

}

return null;

}

});

doc.save(“c:/bad/1704-01-2012-017-C003-025_ASPOSE.pdf”);

Executing this code (with Tess4j dependencies) doesn’t produce a Searchable PDF, Document.CallBackGetHocr sees the image, Tesseract generates the HOCR, but when i save the document, the output is not searchable.

I’m attaching the input, the output and the HOCR generated by Tesseract.

You can avoid using Tesseract by making invoke return the contents of the file “hocr.txt”.

I wish you could help me with this problem.

Thanks for your attention

codewarior · September 21, 2014, 2:05pm

Hi Carlos,

Thanks for contacting support.

We are working over this query and will get back to you soon.

codewarior · October 30, 2014, 2:52am

Hi Carlos,

Thanks for your patience and sorry for the delayed response.

I have tested the scenario using the following code snippet. I used Aspose.Pdf for Java 9.5.2 and tess4j_1_3_0 in an Eclipse Juno project running over Windows 7 (x64) with JDK 1.7. I am afraid to have encountered the following error message.

Java

Document doc = new Document("c:/pdftest/1704-01-2012-017-C003-025.pdf");

doc.convert(new Document.CallBackGetHocr() {
    @Override
    public String invoke(java.awt.image.BufferedImage bi) {
        try {
            Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
            instance.setHocr(true);
            instance.setLanguage("spa");
            String result = instance.doOCR(bi);
            return result;
        } catch (TesseractException ex) {
            ex.printStackTrace();
        }
        return null;
    }
});

doc.save("c:/pdftest/1704-01-2012-017-C003-025_ASPOSE_9_5_2.pdf");

StackTrace

Exception in thread "main" java.lang.UnsatisfiedLinkError: The specified module could not be found.

    at com.sun.jna.Native.open(Native Method)
    at com.sun.jna.Native.open(Native.java:1759)
    at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:260)
    at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:398)
    at com.sun.jna.Library$Handler.(Library.java:147)
    at com.sun.jna.Native.loadLibrary(Native.java:412)
    at com.sun.jna.Native.loadLibrary(Native.java:391)
    at net.sourceforge.tess4j.TessAPI.(TessAPI.java:45)
    at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:283)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:219)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:200)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:184)
    at test$1.invoke(test.java:222)
    at com.aspose.pdf.internal.p476.z19.m1(Unknown Source)
    at com.aspose.pdf.ADocument.convert(Unknown Source)
    at com.aspose.pdf.Document.convert(Unknown Source)
    at test.main(test.java:215)

I am afraid I am unable to understand your point “You can avoid using Tesseract by making invoke return the contents of the file ‘hocr.txt’.” Can you please share some further details or code snippet that can help us in replicating the issue in our environment?

carlosl · October 30, 2014, 9:27am

Hi, thanks for your reply … the error is because Tesseract cannot find its dependencies (DLLs). To avoid Tesseract you have to modify the invoke method. hocr.txt is attached to my first post. Method FileUtils.readFileToByteArray can be found in commons-io library from Apache and is used to load the file into a byte array.

Thanks for your attention

public String invoke(BufferedImage bi) {
try{
String hocr = new String(FileUtils.readFileToByteArray(new File()));
return hocr;
}
catch(Exception exc){
exc.printStackTrace();
}
return null;
}

codewarior · October 31, 2014, 4:40am

Hi Carlos,

Thanks for sharing the details.

I have tested the scenario using the following code snippet and have observed that a searchable PDF file is not being generated. For the sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-34536. We will investigate this issue in detail and will keep you updated on the status of a correction.

We apologize for your inconvenience.

[Java]

Document doc = new Document("c:/pdftest/1704-01-2012-017-C003-025.pdf");
doc.convert(new Document.CallBackGetHocr() {
    @Override
    public String invoke(java.awt.image.BufferedImage bi) {
        try {
            int len;
            char[] chr = new char[4096];
            final StringBuffer buffer = new StringBuffer();
            final FileReader reader = new FileReader("c:/pdftest/hocr.txt");
            try {
                while ((len = reader.read(chr)) > 0) {
                    buffer.append(chr, 0, len);
                }
            } finally {
                reader.close();
            }
            return buffer.toString();
        } catch (Exception exc) {
            exc.printStackTrace();
        }
        return null;
    }
});
doc.save("c:/pdftest/1704-01-2012-017-C003-025_ASPOSE_9_5_2.pdf");

bogdan.kharchykov · February 13, 2015, 11:53am

Hi,

any update on this?

We have the same issue with PDF.NET

Regards

Bogdan

tilal.ahmad · February 15, 2015, 11:49pm

Hi Bogdan,

Thanks for your inquiry. Please note we have implemented the subjected feature in Aspose.Pdf for Java 9.7.1. I have tested your shared PDF document both with Aspose.Pdf for Java and .NET and its working fine. Please check following documentation link for the details, it will resolve the issue.

Converting non searchable PDF to searchable PDF (Java).

Please feel free to contact us for any further assistance.

Best Regards,

carlosl · March 12, 2015, 3:25pm

Hi, i downloaded Aspose PDF for Java 9.7.1 and 10.0.0 to test the changes on the subject with no success.

I’m attaching a snippet of the code i’m using for testing and the PDF i want to make searchable.

To make things simple, i loaded the contents of the HOCR output into a String and used them as a return for the invoke method.

Maybe i’m doing something wrong or the HOCR i’m generating with Tesseract is not compatible.

Thanks for your attention

tilal.ahmad · March 15, 2015, 11:47pm

Hi Carlos,

Thanks for your feedback. It seems you have posted your query twice, so please check our response on your other post.

Best Regards,