Planned tesseract replacement by Aspose?

vmerz · February 27, 2017, 1:57am

Hello support-team,

in order to convert a non-searchable PDF into a searchable PDF I´m forced to use Tesseract to create hOCR-data.
Will there be a solution from Aspose without the need of using third-party-tools in future?

Thanks in advance for your answer!
Regards,
Vincent Merz

tilal.ahmad · February 27, 2017, 9:22pm

Hi Vincent,

Thanks for your inquiry. We have a component Aspose.OCR for text recognition. But I am afraid currently searchable PDF is not supported with it, as Aspose.OCR is still not quite mature. We are facing some issues in text recognition accuracy and its coordinates. Our development team is working hard to fix these issue and investigating some new algorithms for the purpose.

However, please share your sample source document here, we will test it with Aspose.OCR to OCR text and will update you accordignly.

We are sorry for the inconveniecne.

Best Regards,

vmerz · March 15, 2017, 4:39am

Hello Aspose Team,

Thanks for your rapid response.

So you have plans to integrate a functionality in Aspose.OCR that converts an image to pdf (same appearance, but with searchable text) . Did I get that right?

Can you make any predication regarding a release date for that?j

Regards,

Vincent

/* Style Definitions */ table.MsoNormalTable {mso-style-name:"Normale Tabelle"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif";}

<![endif]–>

tilal.ahmad · March 16, 2017, 12:35am

Hi Vincent,

Thanks for your feedback. Please note Aspose.OCR does not support to convert an image to PDF but OCR text from the image and later we can use the OCR text with Aspose.Pdf to create a searchable PDF document.

However as described above, currently Aspose.OCR is not mature enough. Our Aspose.OCR product team is working to improve the API performance and accuracy. Please share your sample image here, so we will test the scenario and will guide you accordingly.

Best Regards,

vmerz · March 17, 2017, 7:59am

Hi Tilal Ahmad,

I´ve attached a example file and the resultfile. As you can see, the source.pdf contains four letters in four corners :
A B
C D

Now I try to do OCR using the hocr-callback mechanism (CallBackGetHocr). I get a callback for each letter, the created hocr-result seems correct to me, even B was not recognized.
The third attachment contains the hocrresults.

When I now open the converted PDF-file, there is a mismatch between the images an the related hocr-results. The contain the following letters: ( _ == no box)

_ C
D A

The hocr-result is not attached correct to the image, because the letterboxes are in the wrong order (just copy paste the text from result.pdf).
I´ve tons of documents with this behaviour, but let´s start with this simple example.

Do you have any idea what is going wrong? Looking forward for a bugfix.

Kind regards,
Vincent Merz

<!–[if gte mso 10]>

/* Style Definitions */ table.MsoNormalTable {mso-style-name:"Normale Tabelle"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif";}

<![endif]–>

ikram.haq · March 20, 2017, 5:14am

Hi Vincent,

Thank you for your inquiry and sharing samples.

This is to update you that we have investigated the issue at our end. We have tried to read the contents of the supplied PDFs. It was found that Aspose.OCR is able to read the text successfully from the attached sample PDFs. Sample code snippet is give below for your reference.

CODE:

vmerz · March 21, 2017, 4:58am

Hi everyone,

extracting the text from PDF has never been the problem. The problem is to create a searchable PDF from a non-searchable. Especially the position of the text passages in the new document.

As far as i can see there was a improvement from aspose.pdf 17.1 to 17.2. Still there is a problem with the CallBackGetHocr-mechanism in combination with some scans.

Once again the problem:

Sourcedocument is a PDF/A scan from a Xerox WorkCentre 5755 (source.pdf)
After processing the document with your CallBackGetHocr-mechanism target.pdf is created.

When you now try to mark the text, it is misplaced as you can see in the attached image (misplaced.png)

Do you have any idea, why this happens?
Regards,
Vincent

codewarior · March 22, 2017, 6:02am

Hi Vincent,

Thanks for sharing the details.

We are working on testing the scenario and will get back to you soon.

codewarior · April 7, 2017, 11:21am

Hi Vincent,

Thanks for your patience.

I have tested the scenario using following code snippet with Aspose.Pdf for Java 17.3.0 and I am unable to notice any issue in the resultant file. Can you please share some more details, so that we can further look into this matter.

For your reference, I have attached the output generated over my end.

final String myDir = "c:/pdftest/";
com.aspose.pdf.Document doc = new com.aspose.pdf.Document(myDir + "source (1).pdf");

// Create callBack - logic to recognize text for PDF images. Use an external OCR supporting the HOCR standard (http://en.wikipedia.org/wiki/HOCR).
// We are using the free Google Tesseract OCR (http://en.wikipedia.org/wiki/Tesseract_%28software%29).
com.aspose.pdf.Document.CallBackGetHocr cbgh = new com.aspose.pdf.Document.CallBackGetHocr() {
    public String invoke(java.awt.image.BufferedImage img) {
        File outputfile = new File(myDir + "test.jpg");
        try {
            ImageIO.write(img, "jpg", outputfile);
        } catch (IOException e1) {
            e1.printStackTrace();
        }
        try {
            java.lang.Process process = Runtime.getRuntime()
                    .exec("tesseract" + " " + myDir + "test.jpg" + " " + myDir + "out hocr");
            System.out.println("tesseract" + " " + myDir + "test.jpg" + " " + myDir + "out hocr");
            process.waitFor();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        // Read out.html to a string
        File file = new File(myDir + "out.html");
        StringBuilder fileContents = new StringBuilder((int) file.length());
        java.util.Scanner scanner = null;
        try {
            scanner = new java.util.Scanner(file);
            String lineSeparator = System.getProperty("line.separator");
            while (scanner.hasNextLine()) {
                fileContents.append(scanner.nextLine() + lineSeparator);
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } finally {
            if (scanner != null)
                scanner.close();
        }

        // Deleting temporary files
        File fileOut = new File(myDir + "out.html");
        if (fileOut.exists()) {
            fileOut.delete();
        }

        File fileTest = new File(myDir + "test.jpg");
        if (fileTest.exists()) {
            fileTest.delete();
        }

        return fileContents.toString();
    }
};

// End callBack
doc.convert(cbgh);
doc.save(myDir + "output971.pdf");

vmerz · April 11, 2017, 8:58am

Hi Nayyer,

thank you for your answer. To comprehend my issue, just open your generated PDF-File and try to mark the text. You will see, that you are not able to mark the text in the right place. A screenshot as example is attached.

Regards,
Vincent

codewarior · April 12, 2017, 8:23am

Hi Vincent,

Thanks for using our API’s.

I have tested the scenario and have managed to reproduce same problem. For the sake of correction, I have logged it as PDFJAVA-36669 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

aspose.notifier · June 8, 2017, 8:08am

The issues you have found earlier (filed as PDFJAVA-36669) have been fixed in Aspose.Pdf for Java 17.5.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.