Searchable PDF overlay with Aspose.PDF and Tesseract

leszek.wisniewski · December 1, 2015, 9:56am

Dear support,

I would like to render a searchable layer over an already existing PDF document. I got acquainted thoroughly with your tutorial regarding usage of Tesseract OCR for text extraction from an image. What blocks me from successfully executing the example attached to issue PDFNEWJAVA-33678 is an error that I get from Aspose:

tesseract src/main/resources/test-files/test.jpg src/main/resources/test-files/out hocr

Exception in thread “main” class com.aspose.pdf.internal.ms.System.z75: Unknown char: ;

com.aspose.pdf.internal.p584.z6.m1(Unknown Source)

com.aspose.pdf.internal.ms.System.z61.m1(Unknown Source)

com.aspose.pdf.internal.ms.System.z43.m12(Unknown Source)

com.aspose.pdf.internal.p88.z5.m1(Unknown Source)

com.aspose.pdf.ADocument.convert(Unknown Source)

com.aspose.pdf.Document.convert(Unknown Source)

ch.mimacom.zurich.AsposePDFScanner.scanText(AsposePDFScanner.java:80)

ch.mimacom.zurich.Starter.main(Starter.java:10)

sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

java.lang.reflect.Method.invoke(Method.java:497)

com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

And an excerpt from generated html by Tesseract:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

“http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>

Roland und Stephan Aebi, p.Adr. Aspa AG, Amerbachslrasse 72, 4007 Basel

I am using Tesseract 3.04 & Aspose PDF 10.9.0. I am evaluating available OCR tools for my company for purchase, so any help with this is greatly appreciated.

tilal.ahmad · December 2, 2015, 2:37am

Hi Leszek,

Thanks for your inquiry. We will appreciate it if you please share you source PDF document here, So we will test the scenario at our end and will guide you accordingly.

We are sorry for the inconvenience caused.

Best Regards,

leszek.wisniewski · January 12, 2016, 7:08am

Dear support,

I would like to attach the test file that caused this error. Please take a look.

Dropbox - Vertrage_merged.pdf - Simplify your life

Best regards,

Leszek Wisniewski

tilal.ahmad · January 13, 2016, 2:10am

Hi Leszek,

Thanks for sharing your source document. I have tested the scenario and noticed the issue, so logged a ticket PDFNEWJAVA-35431 in our issue tracking system for further investigation and resolution. We will keep you updated about the issue resolution progress.

We are sorry for the inconvenience caused.

Best Regards,

leszek.wisniewski · January 13, 2016, 2:23am

Dear support,

Thank you for looking into this issue. My company is already an active customer and has a license for Aspose.PDF. As I would very much like to reuse the OCR functionality embedded in that module, are you able to say when the fix for the issue could be produced?

Thank you!

Best regards,

Leszek Wisniewski

tilal.ahmad · January 13, 2016, 10:32pm

Hi Leszek,

Thanks for your feedback. I am afraid we have recently noticed the issue and it is pending for the investigation in the queue with other reported issues. We can not share ETA at the moment, we will be in a good position to share ETA as soon as our product team completes the initial investigation. We will notify you as soon as we made some significant progress towards issue resolution.

We are sorry for the inconvenience caused.

Best Regards,

aspose.notifier · May 10, 2016, 2:13pm

The issues you have found earlier (filed as PDFNEWJAVA-35431) have been fixed in Aspose.Pdf for Java 11.5.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.