Issues in Aspose.PDF CallBackGetHocr

vmerz · July 4, 2018, 9:05am

Dear Aspose Team,
we’ve got an very big issue with hundreds of documents when using the Aspose.PDF CallBackGetHocr mechanism.
You can find a very simple sample document attached to this post.
What I can tell you about the faulty process is this:
• Although there are 4 images in the pdf file, the callback was triggered only 3 times
• The hocr result (html) was assigned to the wrong image. The text of the “tall text container” was connected to the lowest “more text” image.
You can see the effect best when opening the processed document in a pdf viewer and try to select the text behind the lowest “more text” box.
The same issue was reported and logged as PDFJAVA-36669. We are using aspose.pdf version 17.6 but the problem still exists (with other documents).
That leads to the next question. Do you need all correlated document to fix the problem, or is there a chance to get this fixed in a more general manner?

Kind regards

example.pdf (86.2 KB)
result.pdf (89.2 KB)

imran.rafique · July 5, 2018, 7:48am

@vmerz,

We have tested your scenario with the latest version 18.6.1 of Aspose.PDF for .NET API and Tesseract 4.0. The output PDF looks fine. This is the output PDF: Output.pdf (89.9 KB). If this does not help, then please send us the complete code. Your response is awaited.

imran.rafique · July 5, 2018, 8:07am

@vmerz,

In addition to the above reply, we have also tested with the latest version 18.6 of Aspose.PDF for Java API and Tesseract 4.0. This is the output PDF (call back triggers 4 times): outputJava.pdf (89.9 KB)

Java

String myDir = "C:\\Pdf\\test937\\";
Document doc = new Document(myDir + "example.pdf");
// Create callBack - logic recognize text for pdf images. Use outer OCR supports HOCR standard(http://en.wikipedia.org/wiki/HOCR).
// We have used free google tesseract OCR(http://en.wikipedia.org/wiki/Tesseract_%28software%29)
CallBackGetHocr cbgh = new CallBackGetHocr() {
	@Override
	public String invoke(java.awt.image.BufferedImage img) {
		File outputfile = new File(myDir + "test.jpg");
		try {
			ImageIO.write(img, "jpg", outputfile);
		} catch (IOException e1) {
			e1.printStackTrace();
		}

		try {
			java.lang.Process process = Runtime.getRuntime().exec("C:\\Program Files (x86)\\Tesseract-OCR\\tesseract" + " " + myDir + "test.jpg" + " " + myDir + "out hocr");
			System.out.println("tesseract" + " " + myDir + "test.jpg" + " " + myDir + "out hocr");
			process.waitFor();

		} catch (IOException e) {
			e.printStackTrace();
		} catch (InterruptedException e) {
			e.printStackTrace();
		}

		// reading out.html to string
		File file = new File(myDir + "out.hocr");
		StringBuilder fileContents = new StringBuilder((int) file.length());
		Scanner scanner = null;
		try {
			scanner = new Scanner(file);
			String lineSeparator = System.getProperty("line.separator");

			while (scanner.hasNextLine()) {
				fileContents.append(scanner.nextLine() + lineSeparator);
			}
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} finally {
			if (scanner != null)
				scanner.close();
		}

		// deleting temp files
		File fileOut = new File(myDir + "out.html");
		if (fileOut.exists()) {
			fileOut.delete();
		}
		File fileTest = new File(myDir + "test.jpg");
		if (fileTest.exists()) {
			fileTest.delete();
		}

		return fileContents.toString();
	}
};
// End callBack

doc.convert(cbgh);
doc.save(myDir + "outputJava.pdf");

vmerz · July 5, 2018, 10:41am

@imran.rafique

thank you for this information. It’s nice to know that the current version should probably work.
But how can we get this in our Version. Is there a bugfix for 17.6?
To make things clear: My request to fix this bug was more than a year ago. I don’t want to pay for every bug in your software.

imran.rafique · July 5, 2018, 8:41pm

@vmerz,

We do not provide fixes in the old version, and recommend our clients always try the latest version of Aspose.PDF API.

vmerz · July 6, 2018, 8:51am

@imran.rafique

Okay I didn’t know that, but that’s also not a problem. So you only need to provide a new license file so I can use the new Version. I’m also fine with this procedure.
Our subscription expired in January 2018, but I reported the bug (PDFJAVA-36669) in April 2017. So there should not be much trouble for us.
Please send me the valid license file.

imran.rafique · July 6, 2018, 1:01pm

@vmerz,

We have posted your query regarding the new license in the Aspose.Purchase forum. Please refer to this forum thread: https://forum.aspose.com/t/require-the-new-license/179356

vmerz · July 6, 2018, 2:22pm

@GeorgeClark & @imran.rafique
So is this the approach of Aspose? Just don’t fix bugs for months until the users subscription ended and then force him to pay thousands of dollars again? [ironic]Stunning is an understatement![/ironic]
This was a BUG, no Feature.

asad.ali · July 6, 2018, 10:38pm

@vmerz

Thanks for writing to us.

We apologize if any of our responses/replies made you unhappy. Please note that sometimes issues are related to specific document and they are resolved for that specific document only. The earlier logged issue PDFJAVA-36669 was reported for a specific PDF document and resolved for that in Aspose.PDF for Java 17.5 release.

You are now facing similar issue for different PDF documents and after testing the scenario with latest API version, we found that issue was not occurring with latest version of the API. Which was why we suggested you to please upgrade to latest release of the API. Please note that, issues reported for older versions of the API, are used to be fixed in latest versions.

In case we were able to replicate the mentioned issue with Aspose.PDF for Java 18.6, we would definitely have been logging it in our issue tracking system and it would have resolved in later versions of the API. In that case, you would still be upgrading to latest release, in order to get issue fixed.

We again apologize that you had to make such impression about our support. Kindly, note that issues logged in free support model, are treated with low priority and resolved on first come first serve basis. Hence, resolution of the issue may take months depending upon how long is the queue of issues, logged prior to it. Whereas, in Paid Support model, we resolve issues on urgent basis and those issues have precedence over the issues logged under free support mode.

We request you to please upgrade your API to 18.6 version and in case you still face any issue, please feel free to let us know.