Error making pdf searchable when using aspose with tesseract

mike1986 · June 20, 2018, 2:51pm

According example provided in apose doc, i’m making searchable pdf with aspose and tesseract but when i’m using my app on some computer i have probleme to render accented characters.

the charset on pc having trouble to render accented is windows-1252
and on running computer is utf-8.

imran.rafique · June 20, 2018, 7:21pm

@mike1986,

Please send all details of the scenario, including source PDF, code and call stack of the error. We will investigate your scenario in our environment, and share our findings with you.

mike1986 · June 21, 2018, 7:12am

            final Path temp= Files.createTempDirectory("testAspose"+Long.toString(System.nanoTime()));
            Document.CallBackGetHocr cbgh = new Document.CallBackGetHocr() {
                @Override
                public String invoke(java.awt.image.BufferedImage img){
                    File outputfile = new File(temp +"/" + "test.jpg");
                    try {
                        ImageIO.write(img, "jpg", outputfile);
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                    try {
                        java.lang.Process process = Runtime.getRuntime().exec("tesseract" + " " + temp +"/" + "test.jpg" + " " + temp +"/out.html" + " hocr -l fra+eng");
                        //System.out.println("tesseract" + " " + temp +"/" + "test.jpg" + " " + temp +"/" + " hocr");
                        process.waitFor();

                    } catch (IOException e) {
                        e.printStackTrace();
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                    File file = new File(temp +"/" + "out.html.hocr");
                    StringBuilder fileContents = new StringBuilder((int) file.length());
                    Scanner scanner = null;
                    try {
                        scanner = new Scanner(file);
                        String lineSeparator = System.getProperty("line.separator");

                        while (scanner.hasNextLine()) {
                            fileContents.append(scanner.nextLine() + lineSeparator);
                        }
                    } catch (FileNotFoundException e) {
                        e.printStackTrace();
                    } finally {
                        if (scanner != null)
                            scanner.close();
                    }
                    return fileContents.toString();
                }
            };
            try {
                doc.convert(cbgh);
                doc.save(file.getPath());
            }
            catch (Exception e)
            {
                System.out.println("error");
                e.printStackTrace();
            }

So i h’ave no error return but when searching on pdf resulting of this code, accented characters are replaced by cabalistic signs. but other characters are ok. i will send you pdf in private message

imran.rafique · June 21, 2018, 4:05pm

@mike1986,

We have tested your scenario with the latest version 18.5 of Aspose.PDF for Java API, and the output PDF looks fine. This is the output PDF: Output.pdf (140.8 KB). Please review and let us know how that goes into your environment.

mike1986 · June 22, 2018, 8:14am

Thank you but and i have found a solution by adding -Dfile.encoding=UTF-8 to java option list

imran.rafique · June 22, 2018, 5:20pm

@mike1986,

It is nice to hear from you that the problem has been resolved. Please feel free to let us know whenever you need assistance.