Free Support Forum - aspose.com

Error making pdf searchable when using aspose with tesseract

According example provided in apose doc, i’m making searchable pdf with aspose and tesseract but when i’m using my app on some computer i have probleme to render accented characters.

the charset on pc having trouble to render accented is windows-1252
and on running computer is utf-8.

@mike1986,

Please send all details of the scenario, including source PDF, code and call stack of the error. We will investigate your scenario in our environment, and share our findings with you.

            final Path temp= Files.createTempDirectory("testAspose"+Long.toString(System.nanoTime()));
            Document.CallBackGetHocr cbgh = new Document.CallBackGetHocr() {
                @Override
                public String invoke(java.awt.image.BufferedImage img){
                    File outputfile = new File(temp +"/" + "test.jpg");
                    try {
                        ImageIO.write(img, "jpg", outputfile);
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                    try {
                        java.lang.Process process = Runtime.getRuntime().exec("tesseract" + " " + temp +"/" + "test.jpg" + " " + temp +"/out.html" + " hocr -l fra+eng");
                        //System.out.println("tesseract" + " " + temp +"/" + "test.jpg" + " " + temp +"/" + " hocr");
                        process.waitFor();

                    } catch (IOException e) {
                        e.printStackTrace();
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                    File file = new File(temp +"/" + "out.html.hocr");
                    StringBuilder fileContents = new StringBuilder((int) file.length());
                    Scanner scanner = null;
                    try {
                        scanner = new Scanner(file);
                        String lineSeparator = System.getProperty("line.separator");

                        while (scanner.hasNextLine()) {
                            fileContents.append(scanner.nextLine() + lineSeparator);
                        }
                    } catch (FileNotFoundException e) {
                        e.printStackTrace();
                    } finally {
                        if (scanner != null)
                            scanner.close();
                    }
                    return fileContents.toString();
                }
            };
            try {
                doc.convert(cbgh);
                doc.save(file.getPath());
            }
            catch (Exception e)
            {
                System.out.println("error");
                e.printStackTrace();
            }

So i h’ave no error return but when searching on pdf resulting of this code, accented characters are replaced by cabalistic signs. but other characters are ok. i will send you pdf in private message

@mike1986,

We have tested your scenario with the latest version 18.5 of Aspose.PDF for Java API, and the output PDF looks fine. This is the output PDF: Output.pdf (140.8 KB). Please review and let us know how that goes into your environment.

Thank you but and i have found a solution by adding -Dfile.encoding=UTF-8 to java option list

@mike1986,

It is nice to hear from you that the problem has been resolved. Please feel free to let us know whenever you need assistance.